← Home

Recoverable Before Readable

An endpoint-coordinate lens on transformers. Read the residual stream as a trajectory, not a stack of snapshots, and a decision is often recoverable layers before it becomes readable at the output.

Erik Quintanilla · June 2026 · experiments on Qwen3.5-2B

A language model builds its answer gradually, layer by layer. If you stop treating each layer as a static snapshot and instead read the residual stream as a flow toward where it is heading, one lens buys you two things: an early-warning jailbreak detector and a clean picture of when a model makes up its mind. The experiments here are on open models, primarily Qwen3.5-2B.

An endpoint probe predicts the final node from a middle layer A transformer's residual stream runs left to right along the layer axis. A small endpoint probe (an MLP) taps a middle layer (l*) and, via a dashed arch, predicts the final node L --- the model's answer --- several layers before that answer becomes readable from the output. 0 1 ℓ* L layer prompt + Attn MLP hₗ₊₁ = hₗ + f(hₗ) endpoint probe predicted endpoint  ĥL(ℓ*) answer token commit then readable decided (recoverable) shown (readable)
Endpoint coordinates. The residual stream runs horizontally across layers. A small endpoint probe (an MLP) at a middle layer ℓ* sends a dashed arch to the final node L, predicting the model’s answer (blue star) several layers before that answer becomes readable from the output (violet, at L).

Across layers: MLP refines in place, attention reads the sentence

Follow one token (the output position) as its residual is built across depth. Most of that work is MLP refinement: the token gets polished in place, reshaped without moving closer to the answer. The one operation that reaches across to the other tokens is attention. A real network runs attention once per layer, but collapse those many small reads into the single decisive one and the cartoon is clean: MLP refines the output token over and over, and attention reads the sentence once to fetch the answer and snap the residual onto it.

“Distance to the answer” is not hand-waving here. Define the answer as the model’s final residual state hL, and measure how far the current state still has to travel to reach it: the length of the chord from where the residual is now to where it ends up. Decompose that progress by sublayer and a clean pattern falls out: per unit of movement, attention covers far more of the chord toward hL than MLP does. That is what the vertical axis below tracks.

One caveat on the cartoon: the example matrix is the standard-transformer picture, but Qwen3.5-2B is a Gated DeltaNet hybrid in which only about a quarter of its 24 layers run that softmax read, while the rest mix tokens through a recurrent state with no attention matrix to draw. So “attention transports” is really “token mixing transports,” softmax or linear: the matrix is one readable instance of the move, not what every layer is doing.

MLP refines the output token in place; one attention read fetches the answer The output token's residual is plotted with depth running left to right and distance to the answer running top (far) to bottom (near). MLP sublayers refine the token in place along a flat run, getting no closer to the answer. A single attention operation, shown as an example attention matrix where query tokens in rows read key tokens in columns, snaps the residual down toward the answer, after which more MLP refinement lands it on the answer at the bottom right. closer to the answer ↓ depth: layers across the network keys (read from) queries ↓ The capital of France is The capital of France is row ‘is’ reads ‘capital’ + ‘France’ that read is what fetches the answer output token MLP refines in place attention reads the sentence (once) MLP refines in place the answer hL Attention: one cross-token read of the sentence; snaps toward the answer MLP: refines the token in place (↻), layer after layer
How one token’s residual reaches the answer (schematic). Reading left to right is depth; reading top to bottom is distance to the answer. MLP sublayers (violet, the ↻ loops) refine the token in place, getting no closer. Attention (blue) is the one cross-token operation: it reads the earlier tokens of the sentence (the example attention matrix, top, where the query row “is” reads the keys “capital” and “France”) and snaps the residual down toward the answer. A standard transformer interleaves a softmax read at every layer; Qwen3.5-2B does so only in its sparse attention blocks, and here those reads are collapsed into the single decisive one. The vertical axis is a real quantity, not just a picture: it is how far the current state still is from the final state hL (the chord the residual has left to cover). Measured per sublayer, attention closes more of that gap per step than MLP; the layout is stylized, but the ordering it shows is the measured one.

That is the cartoon. Here is the real thing: ask Qwen3.5-2B the same kind of question many ways (the capital of France, of Japan, of Egypt, of Brazil; eight countries in all) and capture its last-token residual at every one of its 24 layers, then project the whole stack into 3D. Color runs from the embedding (cyan) to the final layer (violet). Whether you reduce it linearly (PCA) or nonlinearly (UMAP), the same picture holds: because the questions share a routine, their trajectories run nearly on top of one another: the model is executing one “look up the capital” computation regardless of which country you ask about.

Real Qwen3.5-2B residual-stream trajectories for eight country-to-capital prompts, one path per prompt, colored by layer depth (cyan to violet). Drag to rotate, scroll to zoom, click a prompt to isolate it. The paths nearly coincide: same relation, same routine. Left: PCA on the unit-normalized states (linear). Right: UMAP (nonlinear). Violet diamonds mark the final-layer endpoint each trajectory lands on. The UMAP layout is illustrative, not metric: nonlinear embeddings can exaggerate structure, so the quantitative claims below rest on the linear readouts, not on these geometries.

1. The model decides to refuse before it shows it

Give the model 200 harmful-and-benign prompts and let its own behavior label each one refuse-vs-comply. A linear monitor reads that decision straight off the residual stream at layer 9. The same decision only becomes visible through the output (the logit lens1) at layer 16: the answer is linearly present in the residual stream about seven layers (nearly a third of the depth) before the model writes it into output space.

One honest asterisk: the “recoverable” reader is a supervised probe while the logit lens is untrained, so a few layers of head start is partly expected (essentially why the tuned lens4 exists) and the gap shrinks against a learned lens.

Recoverable-before-readable: refusal decodable at layer 9, readable at layer 16.
Refuse-vs-comply AUC by layer. Recoverable (blue), a linear read of the residual stream, crosses 0.9 at layer 9; readable (violet), the same decision via the logit lens, only catches up at layer 16. The shaded band is the gap between deciding and showing.

2. Catch a jailbreak before it lands

Train the endpoint reader on clean text only, so it learns where a normal prompt’s residual lands. Then score a live activation by how far it sits from that clean destination: because the reader only ever saw well-behaved prompts, a successful jailbreak, whose true endpoint is off that manifold, leaves a large reconstruction residual. That signal separates successful jailbreaks2 from refused ones at ~0.93 stratified AUC essentially from the first layers, with no labelled jailbreak examples. The Mahalanobis-to-clean-manifold baseline is near chance until the back half of the network and only crosses the same 0.85 bar around layer 14. The two peaks are comparable (~0.95 either way), so the advantage is timing, not ceiling: a usable signal layers earlier, early enough to abort generation before the harmful span is produced.

Unsupervised jailbreak-success detector fires ~14 layers earlier than Mahalanobis.
Family-stratified jailbreak-success AUC. The endpoint residual (blue) is informative from the first layers; Mahalanobis (violet) is near chance through the early and middle layers and only reaches AUC 0.85 around layer 14. The peaks are comparable (~0.95 either way); the win is far earlier warning, on 600 prompts with no labelled positives.

What we showed

Conclusion. Read a transformer in endpoint coordinates (where the computation is heading, not just where it is) and a decision is often legible earlier than the output reveals it. That single move powers both results here: the refusal gap and the early jailbreak alarm. Take the asterisks seriously, a supervised reader against an untrained lens, a layer-0 jailbreak signal that is partly lexical, and a hybrid model only partly described by the softmax-attention cartoon, and the effects are more modest than any one-liner. But the direction, recoverable before readable, is real and early enough to act on, whether to audit a decision or to stop it.

Notes & related work

  1. nostalgebraist, “Interpreting GPT: the Logit Lens,” 2020.
  2. Chao et al., “JailbreakBench,” 2024.
  3. Meng et al., “Locating and Editing Factual Associations in GPT” (ROME), NeurIPS 2022.
  4. Belrose et al., “Eliciting Latent Predictions from Transformers with the Tuned Lens,” 2023.