How Attention Works in Large Language Models

My notes on the attention mechanism in LLMs

April 3, 2026

Attention is the mechanism that made transformers practical for large-scale language modeling. It is also one of those ideas that looks simple after you have seen it enough times, but feels slippery the first time you come across it.

I recently revisited the self-attention mechanism while trying to build a minimal Transformer from scratch. Like many engineers, I had used libraries where attention is a one-liner , but while I was looking deeper into its implementation, I thought it might be useful if I put mine on my webiste, instead of on paper or on my whiteboard.

The post here is deliberately expansive. I wanted it all in one place because I find myself having to remind myself how something worked when I'm trying an alternative approach or reading a paper about somethign similar and need to compare.

Why attention replaced strictly sequential sequence models
Sequential path (RNN / LSTM)Dense interaction graph (self-attention)x₁x₂x₃x₁x₂x₃x₁x₂x₃Long paths make distant interactionsharder to preserve.Any token can attend to any other tokenwithout passing through many steps.
The left path has to pass information token by token. The right path lets every token form direct connections in a single layer.

The Problem

Before transformers, sequence modeling was dominated by recurrent networks like RNNs and LSTMs.

They worked, but they carried two structural limitations:

  1. They process tokens sequentially.
  2. Information has to travel through many intermediate steps.

That creates two familiar problems.

1. Long-range dependencies degrade

Suppose a model is reading:

"The research paper that I read last week, after several dense appendices, was surprisingly clear."

When the model reaches the word clear, it may need information from paper. In a recurrent architecture, that information has to survive many update steps. Even if gating helps, the path between those two words is still long.

2. Training is hard to parallelize

The hidden state at time step tt depends on the hidden state at time step t1t - 1. That means the model cannot process the whole sequence in one batched matrix multiplication. Throughput suffers.

CNN-based sequence models improve the parallelism story, but they still depend on stacked local receptive fields. Global interaction does not appear immediately. It has to be built up layer by layer.

So the design target becomes:

  • every token should be able to interact with every other token
  • the model should decide which interactions matter
  • the whole computation should be expressible with matrix operations

That target is exactly where attention comes from.

Initial Intuition

My first instinct the first time I thought about this was embarrassingly naive:

Why not just average all token embeddings together and give every token the global average?

That fails immediately. If every token receives the same average, then the representation of each token becomes less specific, not more specific. The sentence collapses into a single blurry summary.

The next idea is closer:

  1. compare tokens to each other
  2. measure how relevant one token is to another
  3. use those relevance scores to form a weighted mixture

That is already the essence of attention.

What remains is to make that idea:

  • learnable
  • stable
  • efficient
  • compatible with batched matrix operations

The Core Question

At its heart, self-attention answers one question:

Everything else in the mechanism exists to answer that question in a differentiable, trainable way.

Setup and Notation

Assume an input sequence has nn tokens, and each token embedding has dimension dmodeld_{model}.

We stack the token embeddings row-wise:

Input Matrix
XRn×dmodelX \in \mathbb{R}^{n \times d_{model}}X=[x1x2xn]X = \begin{bmatrix} x_1 \\ x_2 \\ \vdots \\ x_n \end{bmatrix}

Here:

  • nn is the sequence length
  • xiRdmodelx_i \in \mathbb{R}^{d_{model}} is the embedding for token ii

The goal is to transform XX into a new matrix of context-aware token representations.

Step 1: Project Tokens into Queries, Keys, and Values

The first important move is to stop using the raw embeddings directly for comparison.

Instead, each token is projected into three learned spaces:

  • Query: what this token is looking for
  • Key: what this token offers
  • Value: the information this token contributes if selected
From embeddings to queries, keys, and values
Input embeddings Xshape: n x d_modelW_Qd x d_kW_Kd x d_kW_Vd x d_vQueries Qshape: n x d_kKeys Kshape: n x d_kValues Vshape: n x d_vx₁x₂x₃… xₙ
Every token embedding is projected three times. Queries decide what to look for, keys expose what each token contains, and values carry the information that is mixed together.

We define three learned matrices:

Learned Projections
WQRdmodel×dk,WKRdmodel×dk,WVRdmodel×dvW_Q \in \mathbb{R}^{d_{model} \times d_k}, \qquad W_K \in \mathbb{R}^{d_{model} \times d_k}, \qquad W_V \in \mathbb{R}^{d_{model} \times d_v}

Then:

Q, K, and V
Q=XWQRn×dkQ = XW_Q \in \mathbb{R}^{n \times d_k}K=XWKRn×dkK = XW_K \in \mathbb{R}^{n \times d_k}V=XWVRn×dvV = XW_V \in \mathbb{R}^{n \times d_v}

This separation is subtle, but it is the first big conceptual leap.

If we used the same embedding for matching and for information transfer, then the model would have to use one representation for two different jobs:

  1. deciding relevance
  2. carrying content

Splitting them lets the network learn those roles independently.

Why Queries and Keys?

Suppose token ii is asking: "who should I pay attention to?"

Its query vector qiq_i encodes that request.

Every other token jj exposes a key vector kjk_j that says what kind of information it contains. If qiq_i aligns strongly with kjk_j, then token jj is relevant to token ii.

That gives us a natural scoring function:

Pairwise Compatibility
score(i,j)=qikj\text{score}(i, j) = q_i \cdot k_j

The dot product is not the only possible similarity measure, but it is simple, differentiable, efficient, and easy to batch.

Step 2: Build the Score Matrix

Now we compute all pairwise query-key interactions at once:

Attention scores form a learned relationship matrix
Qn x d_kKᵀd_k x nScore matrix S = QKᵀshape: n x n×=q₁q₂q₃… qₙk₁k₂k₃… kₙsoftmax over each rowturns scores intoattention weights A
Row i answers: if token i is asking the question, how strongly does it align with every token j?
Raw Attention Scores
S=QKS = QK^\top

Because:

  • QQ has shape n×dkn \times d_k
  • KK^\top has shape dk×nd_k \times n

their product has shape:

Shape of the Score Matrix
SRn×nS \in \mathbb{R}^{n \times n}

Entry (i,j)(i, j) tells us how much token ii should attend to token jj before normalization.

This matrix is the learned relationship graph of the sequence.

A Concrete Worked Example

Take a very short token sequence:

[the,cat,sat,there][\texttt{the}, \texttt{cat}, \texttt{sat}, \texttt{there}]

Imagine we are updating the representation of sat.

Its query might align strongly with:

  • cat, because the verb wants its subject
  • there, because location information may matter
  • less strongly with the, because determiners often contribute less semantic content

If the raw scores for sat are:

[1.2, 3.1, 2.7, 2.4][1.2,\ 3.1,\ 2.7,\ 2.4]

then attention will not keep them as raw magnitudes. It will turn them into a normalized distribution.

That is the next step.

Step 3: Scale the Scores

If we stop at QKQK^\top, the mechanism works in principle, but training becomes unstable as dimensionality grows.

The standard transformer rescales the scores by dk\sqrt{d_k}:

Scaled Dot-Product Attention Scores
S~=QKdk\tilde{S} = \frac{QK^\top}{\sqrt{d_k}}

Why this specific factor?

Because a dot product over dkd_k dimensions is a sum of many terms. If each term has roughly unit variance, then the variance of the sum grows with dkd_k.

So as the key/query dimension gets larger:

  • raw scores become larger in magnitude
  • softmax becomes sharper
  • gradients become less well-behaved

Dividing by dk\sqrt{d_k} counteracts that growth.

More formally, if:

qikj==1dkqikjq_i \cdot k_j = \sum_{\ell = 1}^{d_k} q_{i\ell} k_{j\ell}

and the summands are roughly independent with variance near 11, then the variance of the sum grows proportionally to dkd_k. Scaling by dk\sqrt{d_k} brings the magnitude back into a range where softmax behaves more smoothly.

This one line is not cosmetic. It is one of the reasons attention trains reliably at scale.

Step 4: Normalize Row-Wise with Softmax

The model now has a matrix of scaled compatibility scores. But those scores are not yet usable as mixing weights.

We want each token to distribute one unit of attention mass across the sequence.

So we apply softmax across each row:

Attention Weights
A=softmax(QKdk)A = \text{softmax}\left(\frac{QK^\top}{\sqrt{d_k}}\right)

This means:

Aij=exp(S~ij)m=1nexp(S~im)A_{ij} = \frac{\exp(\tilde{S}_{ij})}{\sum_{m=1}^{n} \exp(\tilde{S}_{im})}

Important consequences:

  • every entry is nonnegative
  • every row sums to 11
  • each row is a probability distribution over tokens
A toy attention pattern
Attention weights A0.550.250.120.080.100.640.180.080.060.180.580.180.050.100.200.65t₁t₂t₃t₄t₁t₂t₃t₄
A small example makes the row-wise normalization visible. Darker cells indicate larger attention mass.

Now row ii answers the question:

When token ii looks across the sequence, how much weight should it place on each token?

Step 5: Aggregate the Values

Once we have attention weights, we use them to mix the value vectors.

Weighted Value Aggregation
O=AVO = AV

Since:

  • ARn×nA \in \mathbb{R}^{n \times n}
  • VRn×dvV \in \mathbb{R}^{n \times d_v}

the output has shape:

ORn×dvO \in \mathbb{R}^{n \times d_v}

Each output row is:

oi=j=1nAijvjo_i = \sum_{j=1}^{n} A_{ij} v_j

That is the complete mechanism.

Self-Attention in One Line
Attention(Q,K,V)=softmax(QKdk)V\text{Attention}(Q, K, V) = \text{softmax}\left(\frac{QK^\top}{\sqrt{d_k}}\right)V

And for self-attention, those QQ, KK, and VV all come from the same input sequence XX.

Interpreting the Output

After attention, token ii is no longer just its original embedding. It becomes a context-aware mixture of the whole sequence.

This is the decisive difference from older sequence models.

The representation of a token is no longer forced to evolve only through a local sequential state update. It can be rebuilt directly from whichever tokens the model deems relevant.

That makes attention both expressive and parallelizable.

A Shape-First Summary

For reference, the shape flow for a single attention head is:

ObjectMeaningShape
XXinput embeddingsn×dmodeln \times d_{model}
WQW_Qquery projectiondmodel×dkd_{model} \times d_k
WKW_Kkey projectiondmodel×dkd_{model} \times d_k
WVW_Vvalue projectiondmodel×dvd_{model} \times d_v
QQqueriesn×dkn \times d_k
KKkeysn×dkn \times d_k
VVvaluesn×dvn \times d_v
QKQK^\topraw scoresn×nn \times n
AAattention weightsn×nn \times n
OOoutputn×dvn \times d_v

If you remember only one thing operationally, remember this:

Why This Solves the Original Problem

Return to the earlier design target.

Every token can look at every other token

That is built directly into QKQK^\top. The mechanism computes all pairwise interactions in one shot.

The model decides what matters

The learned projections WQW_Q, WKW_K, and WVW_V are trained end to end. There are no hand-written matching rules.

The computation parallelizes well

Everything is a matrix multiplication or elementwise transformation. GPUs are very good at this pattern.

Causal Masking for Decoder-Only LLMs

So far, the formulation allows every token to attend to every other token.

That is fine for bidirectional encoders. It is not fine for autoregressive language modeling.

In a decoder-only LLM, token tit_i must not look at future tokens ti+1,ti+2,t_{i+1}, t_{i+2}, \dots. Otherwise the model would cheat during training.

The fix is a causal mask.

Causal masking prevents peeking into the future
Allowed positionsMasked positionst1t2t3t4t5t1t2t3t4t5For token t₄, only t₁ through t₄ are visible.Scores to later tokens are set to −∞ before softmax,which makes their probability exactly 0.This keeps training and inference alignedwith left-to-right next-token prediction.
Decoder-only language models can only use tokens at or before the current position, so the upper triangle is masked before softmax.

Before softmax, we add a mask matrix MM:

Masked Attention
A=softmax(QK+Mdk)A = \text{softmax}\left(\frac{QK^\top + M}{\sqrt{d_k}}\right)

where:

Mij={0if jiif j>iM_{ij} = \begin{cases} 0 & \text{if } j \le i \\ -\infty & \text{if } j > i \end{cases}

Because exp()=0\exp(-\infty) = 0, masked future positions receive exactly zero attention mass after softmax.

This is the decoder trick that turns generic self-attention into a valid next-token predictor.

Multi-Head Attention

A single attention head gives one learned pattern of interaction. In practice, that is too restrictive.

Different relationships may matter simultaneously:

  • subject-verb agreement
  • long-range references
  • punctuation structure
  • local phrase grouping
  • induction-like copying behavior

So transformers run several attention heads in parallel.

Multi-head attention learns several relationship types at once
Input Xn x d_modelHead 1captures one patternHead 2captures anotherHead 3and so onConcatn x (h d_v)W_Oback tod_model
Each head has its own learned Q, K, and V projections. Their outputs are concatenated and projected back into the model space.

For head hh:

Q(h)=XWQ(h),K(h)=XWK(h),V(h)=XWV(h)Q^{(h)} = XW_Q^{(h)}, \qquad K^{(h)} = XW_K^{(h)}, \qquad V^{(h)} = XW_V^{(h)}

Each head computes:

O(h)=softmax(Q(h)K(h)dk)V(h)O^{(h)} = \text{softmax}\left(\frac{Q^{(h)} K^{(h)\top}}{\sqrt{d_k}}\right)V^{(h)}

Then the heads are concatenated:

Concat(O(1),O(2),,O(H))\text{Concat}(O^{(1)}, O^{(2)}, \dots, O^{(H)})

and projected back:

MHA(X)=Concat(O(1),,O(H))WO\text{MHA}(X) = \text{Concat}(O^{(1)}, \dots, O^{(H)})W_O

with WOW_O mapping the concatenated representation back into the model dimension.

The reason this helps is not mystical. It is representational.

Each head gets its own learned subspace for matching and aggregation. That lets the model discover several distinct relational patterns at once instead of forcing all interaction types through one shared score matrix.

Positional Information: Attention Alone Does Not Know Order

Pure attention is permutation-invariant with respect to the set of token embeddings. If you shuffled the tokens but kept the same embeddings, the mechanism itself has no built-in notion of first, second, or last.

That means attention alone does not know sequence order.

Attention alone has no built-in sense of order
Without positionWith position addedtoken set:[cat, sat, there]Permutation-invariant matchingcannot distinguish order by itself.xᵢ + pᵢ[cat + p₁, sat + p₂, there + p₃]Now content and position travel together,so the model can reason aboutsequence order.
Without positional information, the same set of token embeddings can be permuted with no way for the model to know which token came first.

So transformers inject positional information. Conceptually, the model does not see just xix_i. It sees something like:

zi=xi+piz_i = x_i + p_i

where pip_i is a positional representation for token position ii.

This can be done in several ways:

  • fixed sinusoidal encodings
  • learned positional embeddings
  • rotary or relative position mechanisms in newer models

The implementation details vary, but the conceptual role is the same:

Cross-Attention

Self-attention uses one sequence to produce QQ, KK, and VV.

Cross-attention mixes two sequences:

  • queries come from one sequence
  • keys and values come from another
Cross-attention mixes two sequences instead of one
Decoder statesproduce queries QEncoder outputsproduce keys KEncoder outputsalso produce values VCross-attendeddecoder update
Queries usually come from the current decoder state, while keys and values come from another sequence such as encoder outputs.

For example:

Q=XdecoderWQQ = X_{\text{decoder}} W_Q K=XencoderWKK = X_{\text{encoder}} W_K V=XencoderWVV = X_{\text{encoder}} W_V

Then the same attention formula applies:

CrossAttention(Q,K,V)=softmax(QKdk)V\text{CrossAttention}(Q, K, V) = \text{softmax}\left(\frac{QK^\top}{\sqrt{d_k}}\right)V

The difference is not in the equation. It is in where the inputs come from.

Why Dot Products Work So Well

A natural question is: why use a dot product at all?

Because the dot product between qiq_i and kjk_j acts as a learned compatibility score:

  • if they point in similar directions, the score is large
  • if they are orthogonal or opposed, the score is small or negative

That means the model learns a geometry where relevant matches become aligned.

This is one of the cleanest aspects of attention: the mechanism is not just symbolic matching. It is geometric matching in a learned vector space.

Complexity and Trade-Offs

Attention is powerful, but it is not free.

The score matrix has shape n×nn \times n, so both compute and memory grow quadratically with sequence length.

Asymptotic Cost
time complexityO(n2dk)\text{time complexity} \sim O(n^2 d_k)memory for attention mapO(n2)\text{memory for attention map} \sim O(n^2)

This becomes expensive for long contexts. That is why there is so much research into:

  • sparse attention
  • windowed attention
  • linear attention approximations
  • FlashAttention-style memory-efficient kernels

But the standard scaled dot-product mechanism remains the reference point because it is conceptually simple and empirically strong.

A Minimal PyTorch Implementation

Below is the cleanest implementation that still preserves the core mechanics.

import torch
import torch.nn as nn
import torch.nn.functional as F


class SelfAttention(nn.Module):
    def __init__(self, d_model, d_k=None, d_v=None):
        super().__init__()
        d_k = d_model if d_k is None else d_k
        d_v = d_model if d_v is None else d_v

        self.d_k = d_k
        self.W_q = nn.Linear(d_model, d_k, bias=False)
        self.W_k = nn.Linear(d_model, d_k, bias=False)
        self.W_v = nn.Linear(d_model, d_v, bias=False)

    def forward(self, x, mask=None):
        # x: (batch, seq_len, d_model)
        Q = self.W_q(x)                       # (batch, seq_len, d_k)
        K = self.W_k(x)                       # (batch, seq_len, d_k)
        V = self.W_v(x)                       # (batch, seq_len, d_v)

        scores = Q @ K.transpose(-2, -1)      # (batch, seq_len, seq_len)
        scores = scores / (self.d_k ** 0.5)

        if mask is not None:
            scores = scores.masked_fill(mask == 0, float("-inf"))

        attn = F.softmax(scores, dim=-1)      # row-wise over keys
        out = attn @ V                        # (batch, seq_len, d_v)

        return out, attn

The essential points are:

  • K.transpose(-2, -1) aligns the last two dimensions for matrix multiplication
  • softmax is applied across the sequence axis, not across the feature axis
  • the output is a weighted sum of value vectors, not key vectors

Tensor Shapes in Batch Form

If xx has shape:

(batch,n,dmodel)(\text{batch}, n, d_{model})

then:

Q,KRbatch×n×dkQ, K \in \mathbb{R}^{\text{batch} \times n \times d_k} VRbatch×n×dvV \in \mathbb{R}^{\text{batch} \times n \times d_v} QKRbatch×n×nQK^\top \in \mathbb{R}^{\text{batch} \times n \times n} AVRbatch×n×dvAV \in \mathbb{R}^{\text{batch} \times n \times d_v}

That last line is the easiest place to make a mental mistake. The attention matrix mixes across tokens, not across feature dimensions.

A Small Numerical Example

Take one query vector and three key vectors:

q=[1,0]q = [1, 0] k1=[1,0],k2=[0.5,0.5],k3=[0,1]k_1 = [1, 0], \qquad k_2 = [0.5, 0.5], \qquad k_3 = [0, 1]

Then the raw scores are:

qk1=1,qk2=0.5,qk3=0q \cdot k_1 = 1, \qquad q \cdot k_2 = 0.5, \qquad q \cdot k_3 = 0

If we apply softmax to [1,0.5,0][1, 0.5, 0], the first token gets the largest weight, the second gets some attention, and the third gets the least.

If the value vectors are:

v1=[3,0],v2=[1,1],v3=[0,2]v_1 = [3, 0], \qquad v_2 = [1, 1], \qquad v_3 = [0, 2]

then the output is the weighted mixture:

o=a1v1+a2v2+a3v3o = a_1 v_1 + a_2 v_2 + a_3 v_3

This tiny example captures the entire mechanism. Larger models just do it in high-dimensional spaces, with learned projections, across all tokens at once.

What Attention Is Not

There are a few common misunderstandings worth clearing up.

Attention is not just "importance"

An attention weight is not a universal measure of semantic importance. It is a context-dependent routing weight inside a specific head and layer.

Attention is not explanation by default

People often visualize attention maps as if they were direct explanations of model reasoning. Sometimes they are useful hints. They are not the entire story of how the model computes.

Attention is not the whole transformer

A transformer block also includes:

  • residual connections
  • normalization
  • a position-wise feed-forward network

Attention is the routing mechanism, but not the entire architecture.

Intuition That I Keep Coming Back To

The cleanest way I know to remember attention is this:

That framing preserves all the math without reducing it to magic.

Final Summary

The attention mechanism in large language models emerges from a straightforward sequence of ideas:

  1. compare every token with every other token
  2. make the comparison learnable using query and key projections
  3. stabilize the scores with dk\sqrt{d_k}
  4. normalize row-wise with softmax
  5. use the resulting weights to mix value vectors

That gives us:

Attention(Q,K,V)=softmax(QKdk)V\text{Attention}(Q, K, V) = \text{softmax}\left(\frac{QK^\top}{\sqrt{d_k}}\right)V

and from that one equation we get:

  • global context
  • learned routing
  • parallel computation
  • a practical foundation for modern transformers

This is the core mechanism. Multi-head attention, masking, positional information, and cross-attention are all extensions or constraints around that same core.

For me, the important shift is this: attention is not a mysterious black box layer. It is a structured differentiable lookup over the sequence, where similarity determines routing and routing determines representation.

That is what makes it such a powerful building block for language models.