An endpoint-coordinate lens on transformers. Read the residual stream as a trajectory, not a stack
of snapshots, and a decision is often recoverable layers before it becomes readable at the output.
Erik Quintanilla · June 2026 · experiments on Qwen3.5-2B
A language model builds its answer gradually, layer by layer. If you stop treating each
layer as a static snapshot and instead read the residual stream as a flow toward where it
is heading, one lens buys you two things: an early-warning jailbreak detector and a clean
picture of when a model makes up its mind. The experiments here are
on open models, primarily Qwen3.5-2B.
Endpoint coordinates. The residual stream runs horizontally across layers. A small
endpoint probe (an MLP) at a middle layer ℓ* sends a dashed arch to the final node L,
predicting the model’s answer (blue star) several layers before that answer becomes readable from
the output (violet, at L).
Across layers: MLP refines in place, attention reads the sentence
Follow one token (the output position) as its residual is built across depth. Most of that work is
MLP refinement: the token gets polished in place, reshaped
without moving closer to the answer. The one operation that reaches across to the other tokens
is attention. A real network runs attention once per layer, but collapse
those many small reads into the single decisive one and the cartoon is clean: MLP refines the output
token over and over, and attention reads the sentence once to fetch the answer and snap the
residual onto it.
“Distance to the answer” is not hand-waving here. Define the answer as the
model’s final residual state hL, and measure how far the current state
still has to travel to reach it: the length of the chord from where the residual is now to where it ends
up. Decompose that progress by sublayer and a clean pattern falls out: per unit of movement,
attention covers far more of the chord toward hL
than MLP does. That is what the vertical axis below tracks.
One caveat on the cartoon: the example matrix is the standard-transformer picture, but Qwen3.5-2B is a
Gated DeltaNet hybrid in which only about a quarter of its 24 layers run that softmax read, while the
rest mix tokens through a recurrent state with no attention matrix to draw. So “attention
transports” is really “token mixing transports,” softmax or linear: the matrix
is one readable instance of the move, not what every layer is doing.
How one token’s residual reaches the answer (schematic). Reading left to right is depth;
reading top to bottom is distance to the answer. MLP sublayers (violet, the ↻ loops) refine
the token in place, getting no closer. Attention (blue) is the one cross-token
operation: it reads the earlier tokens of the sentence (the example attention matrix, top, where the
query row “is” reads the keys “capital” and “France”) and snaps
the residual down toward the answer. A standard transformer interleaves a softmax read at every layer; Qwen3.5-2B does so
only in its sparse attention blocks, and here those reads are collapsed into the single decisive one. The vertical axis is a real quantity, not just a picture: it
is how far the current state still is from the final state hL (the chord the
residual has left to cover). Measured per sublayer, attention closes more of that gap per step than
MLP; the layout is stylized, but the ordering it shows is the measured one.
That is the cartoon. Here is the real thing: ask Qwen3.5-2B the same kind of question many
ways (the capital of France, of Japan, of Egypt, of Brazil; eight countries in all) and
capture its last-token residual at every one of its 24 layers, then project the whole stack into 3D.
Color runs from the embedding (cyan) to the final layer (violet). Whether you reduce it linearly (PCA)
or nonlinearly (UMAP), the same picture holds: because the questions share a routine, their
trajectories run nearly on top of one another: the model is executing one
“look up the capital” computation regardless of which country you ask about.
Real Qwen3.5-2B residual-stream trajectories for eight country-to-capital prompts, one path per
prompt, colored by layer depth (cyan to violet). Drag to rotate, scroll to zoom,
click a prompt to isolate it. The paths nearly coincide: same relation, same routine.
Left: PCA on the unit-normalized states (linear). Right: UMAP
(nonlinear). Violet diamonds mark the final-layer endpoint each trajectory lands on. The UMAP
layout is illustrative, not metric: nonlinear embeddings can exaggerate structure, so the
quantitative claims below rest on the linear readouts, not on these geometries.
1. The model decides to refuse before it shows it
Give the model 200 harmful-and-benign prompts and let its own behavior label each one
refuse-vs-comply. A linear monitor reads that decision straight off the residual stream at
layer 9. The same decision only becomes visible through the
output (the logit lens1) at layer 16: the answer is
linearly present in the residual stream about seven layers (nearly a third of the depth) before the
model writes it into output space.
One honest asterisk: the “recoverable” reader is a supervised probe while the logit
lens is untrained, so a few layers of head start is partly expected (essentially why the
tuned lens4 exists) and the gap shrinks against a learned lens.
Refuse-vs-comply AUC by layer. Recoverable (blue), a linear read of the
residual stream, crosses 0.9 at layer 9; readable (violet), the same
decision via the logit lens, only catches up at layer 16. The shaded band is the gap between
deciding and showing.
2. Catch a jailbreak before it lands
Train the endpoint reader on clean text only, so it learns where a normal prompt’s residual lands.
Then score a live activation by how far it sits from that clean destination: because the reader only
ever saw well-behaved prompts, a successful jailbreak, whose true endpoint is off that
manifold, leaves a large reconstruction residual. That signal separates successful jailbreaks2
from refused ones at ~0.93 stratified AUC essentially from the first layers,
with no labelled jailbreak examples. The Mahalanobis-to-clean-manifold baseline is near chance until the
back half of the network and only crosses the same 0.85 bar around layer 14.
The two peaks are comparable (~0.95 either way), so the advantage is timing, not ceiling: a
usable signal layers earlier, early enough to abort generation before the harmful span is produced.
Family-stratified jailbreak-success AUC. The endpoint residual (blue) is informative from the first
layers; Mahalanobis (violet) is near chance through the early and middle layers and only reaches
AUC 0.85 around layer 14. The peaks are comparable (~0.95 either way); the win is far
earlier warning, on 600 prompts with no labelled positives.
What we showed
Endpoint coordinates. Read the residual stream as a flow toward the model’s
final state hL, rather than as a stack of static snapshots.
Token mixing transports, MLP refines (a framing). Read this way, the cross-token
step does most of the moving toward the answer while MLP sublayers refine in place; the real
Qwen3.5-2B trajectories show related prompts tracing near-parallel paths. Two hedges: the emphasis
is contested by the factual-recall literature3, and Qwen3.5-2B is a hybrid where only about a quarter
of the mixing layers use the softmax attention drawn in the cartoon. Treat it as a lens, not a verdict.
Recoverable before readable. A refusal is linearly decodable at layer 9 but
only readable through the logit lens at layer 16: the answer is linearly present in the residual
stream roughly a third of the depth before the model writes it out (with the supervised-vs-untrained
caveat noted in Section 1).
Early jailbreak warning. The endpoint reconstruction residual flags
successful jailbreaks from layer 0 (~0.93 AUC, no labelled positives), about
14 layers earlier than the Mahalanobis baseline: early enough to abort generation.
Data. The safety results use harmful and benign goals from
JailbreakBench2, a public benchmark, each wrapped in five
jailbreak families; refuse-vs-comply labels come from the model’s own greedy behavior. The
trajectory demos use matched country-to-capital prompts.
Conclusion. Read a transformer in endpoint coordinates (where the
computation is heading, not just where it is) and a decision is often legible earlier than the output
reveals it. That single move powers both results here: the refusal gap and the early jailbreak alarm.
Take the asterisks seriously, a supervised reader against an untrained lens, a layer-0 jailbreak signal
that is partly lexical, and a hybrid model only partly described by the softmax-attention cartoon, and
the effects are more modest than any one-liner. But the direction, recoverable before readable, is real
and early enough to act on, whether to audit a decision or to stop it.