Attention Residuals: the paper I mostly did not understand but completely believe | Lab Diary

Before you read this, watch this video from the amazing Jia-Bin Huang. He is my 3blue1brown for LLM stuff.

The last two entries on this diary were about agents. SkillNet was about composing what agents do: skills as packages with dependency graphs. HyperAgents was about how agents improve: evolutionary self-modification of the meta-loop. Both of those papers sit at the application layer. They assume the transformer underneath is a solved problem and build on top of it.

This entry is about the thing underneath.

I need to say something at the start. I understood maybe 60% of this paper. The theoretical analysis, the formal bounds on hidden-state norm growth, the gradient stability proofs, some of the connections to signal propagation theory. That is where I lost the thread. I read those sections multiple times. I followed the general argument but not the specific derivations. I am not going to pretend otherwise.

But the 60% that did land hit harder than anything I have read in a while. The paper is called Attention Residuals, from the Kimi team at Moonshot AI. Thirty-six authors. And what they are proposing is a change to something so fundamental that I had never questioned it: the residual connection. The y = x + F(x) that every transformer layer has used since ResNet. They looked at it and asked whether it has been quietly degrading deep networks this entire time.

I want to be clear about the register of this entry. The HyperAgents piece was me pushing back, asking whether the framing outran the mechanism. This one is different. I think this paper is strong. Not "interesting if it holds up" strong. Strong in the way where the methodology is clean, the theory predicts the empirics, and the whole thing ships in a production model. The honest thing to do is say that while also admitting I cannot fully verify the math. So if anyone is actually reading this, bear with these crazy thoughts.

The problem: hidden states grow and nobody talks about it

Quick primer for anyone who needs it (I needed it to actually understand what I was writing, sorry. And believe it or not, AI helps me write these pieces, but I do review every word. In the end this lab diary is more for me to keep a record of my own thoughts). A transformer model is basically a stack of layers. Each layer takes some representation of your input, processes it, and passes it to the next layer. The "hidden state" is just that intermediate representation as it moves through the stack. Think of it as a long list of numbers that encodes what the model "knows" about your input at that point in the pipeline.

Every transformer layer does the same thing at its core. It takes the hidden state, transforms it through attention and a feedforward network, and adds the result back to the original. That addition is the residual connection:

y = x + F(x)

For any math people out there this function might be the most obvious statement ever, but trust me, it helps us non-math people to actually understand what is going on. In plain English: take the input, process it, then add the processed version back to the original. The "add it back" part is the residual connection. It was introduced in ResNet in 2015 because without it, very deep networks could not learn. The gradient (the signal that tells the network how to update its weights during training) would vanish as it traveled backward through too many layers. The residual connection gives the gradient a shortcut to flow through. It works. Nobody questions it.

The Kimi team questions it.

Here is the issue. Each layer adds its contribution to a running sum. Layer 1 produces some output, and it gets added to the input. Layer 2 takes that sum, transforms it, and adds again. By layer 50, the hidden state is the sum of 50 individual layer contributions plus the original input. By layer 100, it is 100 contributions stacked on top of each other.

The problem is not that the sum exists. The problem is that every layer contributes with equal weight. Layer 1's contribution and layer 99's contribution are both just... added. No weighting. No selection. The hidden state "norm" (which is just a fancy word for the magnitude of that list of numbers, how big the values are overall) grows with depth because you keep adding without any mechanism to control it.

You can feel why this is bad even without the math. Imagine you are writing a document collaboratively. Fifty people each add one paragraph. The document is now fifty paragraphs long, and finding any individual person's contribution requires reading through everything else. Now imagine a hundred people. Two hundred. Each new paragraph matters less because it is a smaller fraction of the total. The signal from any individual layer gets progressively diluted as the network gets deeper.

The paper measures this directly. They track hidden-state norms across depth in standard transformers and show exponential growth. The norms just keep climbing. Layer after layer, the accumulated state gets larger, and each new layer's marginal contribution gets proportionally smaller. By the time you reach the deep layers of a 100+ layer model, the early layers' signals have been buried under the sheer magnitude of the accumulated sum.

Pre-Norm, Post-Norm, DeepNorm. These are the standard responses to this problem. They all apply normalization (basically: rescaling the numbers so they do not get too big) at various points in the layer to keep things stable. And they work, to a degree. But they are treating symptoms. The root cause is the accumulation structure itself. You are still doing y = x + F(x) at every layer. You are still blindly adding. The normalization just rescales the result so it does not explode numerically. It does not solve the dilution problem. It does not give later layers a way to selectively attend to earlier layers' contributions.

When I read this section of the paper, I had one of those moments where you realize you knew the problem existed but had never seen someone name it clearly. Of course the hidden state grows with depth. Of course each layer's contribution gets diluted. It is arithmetically obvious once you think about it. But I had never thought about it, because the residual connection is one of those things that just... is. Like gravity. You do not question y = x + F(x). You question the attention mechanism, the positional encoding, the activation function. Not the wire between layers.

The Kimi team questioned the wire.

What AttnRes does

The fix is elegant in concept, even if the theory behind it is dense.

Instead of y = x + F(x), where every preceding layer's output is accumulated with equal weight, AttnRes computes the residual as an attention operation over all preceding layer outputs.

If you are not familiar with how attention works, here is the short version. Attention is a mechanism where one thing (the "query") gets to look at a bunch of other things (the "keys") and decide how much to care about each one. It produces a weighted mix of those things (the "values"), where the weights are based on relevance. The weights go through a softmax function, which just means they are all positive and they add up to one. So instead of "add everything equally," you get "take a weighted average where the weights reflect what actually matters."

In AttnRes, the current layer is the query. All previous layers are the keys and values. The output is a weighted combination where the weights are learned, input-dependent, and normalized through softmax.

In notation:

y_l = x_l + Attn(x_l, {h_0, h_1, ..., h_{l-1}})

Each layer gets to look back at everything that came before and decide, for this specific input, how much of each preceding layer's output to incorporate. Layer 50 does not just get the blind sum of layers 1 through 49. It gets to attend over them and pick what is relevant.

This is the part that made me stop and re-read. They replaced a fixed-weight accumulation (every layer contributes equally to a running sum) with a softmax attention mechanism (every layer can selectively weight what came before it). The layer itself decides what is relevant from its history.

If you think about it, this is what attention already does within a layer. Tokens attend to other tokens and decide what information is relevant. AttnRes applies the same principle across layers instead of across tokens. Within a layer: "which tokens should I attend to?" Across layers: "which preceding layers should I draw from?" Same mechanism, different axis.

The softmax normalization is doing real work here. Because the weights always sum to one, the output is always a blend of the inputs, never bigger than the biggest input. This is the key insight. With standard residuals, you keep adding and the numbers keep growing. With AttnRes, you take a weighted average, and averages do not grow. If all your preceding layers have hidden states of magnitude 10, the weighted average is also around 10, not 10 times the number of layers. The magnitude is controlled by construction, not by slapping normalization on top after the fact.

That is the intuition. The paper goes much deeper into why this bounds the norm growth to sublinear, with formal analysis I could not fully follow. But the conceptual argument is clean: replace blind accumulation with selective attention, and the growth problem disappears because attention normalizes by design.

Now, the obvious practical issue. If every layer has to look at every preceding layer, the cost grows quadratically with depth. A 100-layer model means layer 100 has to attend over 99 previous layers. That is a lot of extra computation on top of the attention the model is already doing over the input tokens.

This is where Block AttnRes comes in. The idea is simple: do not attend over every single preceding layer. Instead, group layers into blocks (say, 8 or 16 layers per block). Within each block, use the regular old y = x + F(x) residual. At the boundary between blocks, that is where you do the expensive attention-based residual, looking back at the summary of each previous block. Think of it like reading a book. You do not re-read every sentence before writing the next one. You remember the gist of each chapter and refer back to the chapter summaries when you need context.

It is the same kind of trade-off you see everywhere in systems engineering: full precision where it matters, approximation where it does not. The key architectural decisions happen at block boundaries. Within a block, the standard residual is fine because you are only accumulating over a handful of layers. Across blocks, you need the selective attention to prevent the long-range dilution.

I do not fully understand the math behind why the block variant preserves most of the theoretical guarantees of full AttnRes. The paper derives specific conditions on block size and attention temperature. I followed the argument at a high level: the block-level representations capture enough of the inter-layer dynamics that attending over them approximates attending over individual layers. But the formal proof is beyond what I can verify.

The part I did not understand

I want to be specific about what "did not understand" means here, because it matters.

The paper has several sections of theoretical analysis. There is a formal treatment of hidden-state norm growth under standard residual connections, showing that the expected norm scales with depth under certain assumptions about layer output magnitudes. Then there is a parallel analysis for AttnRes, showing that the softmax-weighted combination bounds the norm to sublinear growth. The specific conditions involve the attention temperature, the dimensionality of the key/query projections, and assumptions about the spectral properties of the layer outputs.

I could follow the setup. I could follow the conclusion. I could not follow the derivation in between.

There is also a gradient stability analysis. Quick context: gradients are the signals that flow backward through the network during training, telling each layer how to adjust its weights. If the gradient gets too big as it flows backward ("gradient explosion"), training blows up. If it gets too small ("gradient vanishing"), the early layers stop learning because the signal is too faint by the time it reaches them. The paper shows that backpropagation through the attention-weighted residual path keeps these gradients stable across 100+ layers, where standard residual connections show either explosion or vanishing depending on the normalization scheme. This connects to signal propagation theory and, in some parts, to Neural Tangent Kernel (NTK) analysis, which is a mathematical framework for understanding how deep networks behave during training. I do not have enough background in NTK to evaluate whether the gradient stability claims are tight bounds or just sufficient conditions. The difference matters. Tight bounds mean "this is exactly what happens." Sufficient conditions mean "this is one scenario under which it works, there may be others."

The relationship between the attention temperature parameter and the norm bound is another place where I got lost. Temperature here is a number that controls how "sharp" or "smooth" the attention weights are. A very high temperature makes all the weights roughly equal (every preceding layer gets the same importance, which is basically the same as standard residuals). A very low temperature makes the weights concentrate on a single layer (the model just copies one previous layer and ignores the rest). The paper derives the conditions for finding the sweet spot between these extremes. I understood both endpoints but not the formal path between them.

Here is what I want to say about all of this: my inability to follow a proof is not evidence against the proof. This is an important distinction that I think gets lost sometimes in how people talk about papers. "I did not understand it" and "it is wrong" are completely different statements. I did not understand large parts of this paper. The parts I did understand were rigorous and consistent. The empirical measurements match what the theory predicts. The team that wrote this trained a production-scale model based on these ideas and deployed it. None of that guarantees the proofs are correct, but it makes "I personally could not follow the math" a statement about me, not about the paper.

What the results say

This is where I am back on solid ground, because empirical results I can evaluate.

The paper applies AttnRes to Kimi Linear, a Mixture-of-Experts (MoE) model with 48 billion total parameters and 3 billion activated parameters. MoE is an architecture where the model has many "expert" sub-networks, but for any given input it only activates a few of them. So you get the knowledge capacity of a 48B model with the inference cost of a 3B model. They trained it on 1.4 trillion tokens. This is not a toy experiment. This is not "we tested on a 125M parameter model and extrapolated." This is a production-scale model from a team that ships a consumer product.

The results show 2-5% absolute accuracy improvements on standard benchmarks: MMLU, GSM8K, HumanEval, TriviaQA, ARC, HellaSwag, CMMLU. The improvements are consistent across benchmarks. Not dramatic on any single one, but present on all of them.

Two to five percent might not sound like much if you are used to reading papers that claim 20% improvements on cherry-picked benchmarks. But at this scale, on these benchmarks, 2-5% is significant. These are benchmarks where the top models are separated by single-digit percentage points. Every point is hard-won. And the improvements come not from a new training recipe or a bigger dataset or more compute, but from changing the wiring between layers. Same data, same compute budget, same training procedure, just a different residual connection. That is a clean result.

The hidden-state norm measurements are the part that convinced me the most. The paper shows trajectories of hidden-state norms across layer depth. In the Pre-Norm and Post-Norm baselines, the norms grow exponentially with depth. With AttnRes, the growth is sublinear. The graph is not ambiguous. It is not "squint and you can see the difference." The curves are qualitatively different. The theory says the norms should be bounded sublinearly. The measurement shows they are. Theory and measurement agree.

The gradient stability measurements tell a similar story. Remember: gradients are the training signals that flow backward through the model. If they blow up or vanish, the model cannot learn properly. With AttnRes, gradient norms across depth stay in a stable range, roughly between 0.01 and 1.0. The baselines show the familiar pattern of either gradient explosion (norms spiking at certain depths) or gradient vanishing (norms decaying to near-zero in deep layers). AttnRes keeps them in the healthy zone. Again, the theory predicts this and the measurement confirms it.

Compare this to the HyperAgents results, where I spent a whole section asking whether the improvements were as dramatic as they looked. Wide confidence intervals, p-values that did not clear significance for the key comparison, deliberately low starting points that inflated the apparent gains. None of that is happening here. The Attention Residuals results are modest in magnitude (2-5%), clean in methodology (same compute, same data, only the architecture changes), and consistent with the theoretical predictions. I trust modest claims backed by clean evidence more than dramatic claims with confidence intervals wide enough to drive through.

The paper also shows scaling experiments across different model sizes, confirming that the improvements are consistent as you scale up. This is important because a lot of architectural innovations that work at small scale disappear at large scale (or vice versa). AttnRes seems to hold.

The production angle

Something that weighs heavily in my assessment of this paper: the Kimi team ships a real product. Kimi is a consumer-facing AI assistant. The people who wrote this paper are not just publishing results. They are deploying them. Kimi Linear, the model that uses AttnRes, is part of their production stack.

There is a category of paper where the authors tried something clever, it worked on benchmarks, and they published. And there is a category where the authors changed something fundamental in a system that serves real traffic, measured the impact, and then published about it. This is the second kind. The difference matters because production deployment imposes constraints that benchmarks do not. You have to worry about inference latency, memory budgets, serving costs, failure modes, backwards compatibility. Block AttnRes exists precisely because full AttnRes is too expensive for production. The block variant is the engineering compromise that makes the theoretical insight deployable.

I care about this for a specific reason that connects to the rest of this diary. The last several entries have been about infrastructure I am building on top of transformer models. SkillNet is about composable skills for agents. The Context Platform is about structured retrieval. HyperAgents is about self-improvement loops. All of that sits on top of the transformer. If the residual connection, the most basic structural element of those models, has been limiting depth scaling this entire time, then fixing it is not an incremental improvement to one component. It is a foundation repair.

Think about it this way. If the hidden states in a 100-layer model are so diluted by layer 80 that the model is effectively ignoring early-layer representations, then everything that depends on those representations is degraded. The attention patterns are attending to diluted states. The MLP layers are transforming diluted states. The output predictions are based on diluted states. A fix at the residual level propagates upward through every layer that depends on it. It is not a 2-5% improvement to benchmark accuracy. It is a 2-5% improvement to the entire information flow of the model, which benchmark accuracy happens to measure.

That is why I called this "foundation work" in my head when I was reading it. Not because the benchmark numbers are huge. Because the mechanism is deep.

Sketching the mechanism

I wrote some code to make the difference between standard residuals and AttnRes concrete. This is not from the paper. It is me trying to understand the mechanism well enough to implement a toy version.

import torch
import torch.nn.functional as F


def standard_residual_forward(x, layers):
    """
    Standard transformer forward pass.
    Each layer adds to a running sum. No selection. No weighting.
    """
    h = x
    for layer_fn in layers:
        h = h + layer_fn(h)  # blind accumulation
    return h


def attn_res_forward(x, layers, W_q, W_k, W_v, d_k):
    """
    AttnRes forward pass.
    Each layer attends over all preceding layer outputs
    and selectively aggregates them.
    """
    layer_outputs = [x]  # h_0 is the input embedding

    for i, layer_fn in enumerate(layers):
        h_current = layer_fn(layer_outputs[-1])

        # Build query from current layer output
        Q = h_current @ W_q  # [batch, seq, d_k]

        # Build keys and values from ALL preceding layer outputs
        H = torch.stack(layer_outputs, dim=-2)  # [batch, seq, num_layers, d]
        K = H @ W_k  # [batch, seq, num_layers, d_k]
        V = H @ W_v  # [batch, seq, num_layers, d]

        # Attend: current layer decides what to keep from history
        scores = (Q.unsqueeze(-2) @ K.transpose(-2, -1)) / (d_k ** 0.5)
        weights = F.softmax(scores, dim=-1)

        # Selective aggregation instead of blind sum
        residual = (weights @ V).squeeze(-2)

        layer_outputs.append(h_current + residual)

    return layer_outputs[-1]

Staring at this, the difference is immediate. In standard_residual_forward, the hidden state h is a running sum that grows with every layer. There is no mechanism for layer 50 to say "I need the representation from layer 3 but not from layer 47." Everything is added. Everything accumulates. The hidden state is a democracy where every layer gets exactly one vote regardless of relevance.

In attn_res_forward, each layer computes attention over the full history and decides, based on the actual content of the hidden states, which preceding layers matter for this particular input. The softmax ensures the weights sum to one, so you get a weighted average instead of a growing sum. The magnitude is controlled. The selection is learned.

The Block AttnRes variant would add one more level. Instead of attending over every single preceding layer (which gets expensive for deep networks), you group layers into blocks of, say, 8 or 16 layers. Within each block, standard residuals are fine. At block boundaries, you do the attention over block-level representations. Same principle, manageable cost:

def block_attn_res_forward(x, layers, block_size, W_q, W_k, W_v, d_k):
    """Block AttnRes: attend at block boundaries, standard residuals within."""
    block_outputs = [x]
    h = x

    for i, layer_fn in enumerate(layers):
        h = h + layer_fn(h)  # standard residual within block

        if (i + 1) % block_size == 0:
            # Block boundary: attend over all preceding block outputs
            Q = h @ W_q
            H = torch.stack(block_outputs, dim=-2)
            K, V = H @ W_k, H @ W_v
            scores = (Q.unsqueeze(-2) @ K.transpose(-2, -1)) / (d_k ** 0.5)
            weights = F.softmax(scores, dim=-1)
            h = h + (weights @ V).squeeze(-2)
            block_outputs.append(h)

    return h

The cost drops from O(L-squared) to O(L times B) where B = L / block_size. For a 128-layer model with blocks of 16, that is 128 times 8 = 1,024 attention operations at block boundaries instead of 128 times 128 = 16,384 for full AttnRes. An order of magnitude cheaper, and according to the paper, most of the theoretical benefit is preserved.

Where my head is at

This paper changed something in how I think about the transformer architecture. I have been treating the residual connection as settled infrastructure. Like TCP/IP, or the ASCII character set. Something that works well enough that you never revisit it. You question the attention mechanism, you question the positional encoding, you question the normalization scheme. You do not question y = x + F(x). It is the foundation, and foundations are not supposed to move.

This paper moved it. And the argument for moving it is not speculative. It is backed by theory (which I partially understand), by empirical measurement (which I fully understand), and by production deployment (which I can verify externally). That combination is rare. A lot of architecture papers have theory without production. Or production results without clean theory. Having both, and having them agree, is what makes me believe this even though I cannot independently verify the formal analysis.

The connection to the rest of this diary is something I keep thinking about. I have been building infrastructure for agents: skill systems, context platforms, self-improvement loops. All of that runs on transformer models. If the residual connection has been a bottleneck for depth scaling all along, and AttnRes removes it, then the models my agent infrastructure sits on top of get better. Not because of anything I did. Because someone fixed the plumbing.

There is something humbling about that. I spend my time at the application layer, worrying about prompt structure and tool selection and orchestration patterns. The Kimi team spent their time staring at y = x + F(x) and asking whether it was the right equation. Both kinds of work matter. But the foundation work has a multiplier effect that application work does not. A better residual connection improves every model that uses it, which improves every application built on those models, which improves every user interaction those applications serve. My prompt engineering improves one agent for one task.

I said at the start that I did not understand about 40% of this paper. I want to end by saying something about what that feels like. It is not frustrating. It is motivating. The gap between what I understood and what the paper contains is a map of what I need to learn. The theoretical analysis, the signal propagation theory, the formal norm bounds. These are not decorative. They are load-bearing. The empirical results convinced me the mechanism works. Understanding the theory would tell me why it works and where it might break. That is worth pursuing.

The paper is one of the most methodologically solid things I have read in the last few months. Not because of flash, not because of headline numbers. Because the theory is explicit, the experiments are clean, and the whole thing runs in production. That is the bar.

More reading. More math, probably. I will come back to this.