How Attention Works in Large Language Models
My notes on the attention mechanism in LLMs
April 3, 2026
Attention is the mechanism that made transformers practical for large-scale language modeling. It is also one of those ideas that looks simple after you have seen it enough times, but feels slippery the first time you come across it.
I recently revisited the self-attention mechanism while trying to build a minimal Transformer from scratch. Like many engineers, I had used libraries where attention is a one-liner , but while I was looking deeper into its implementation, I thought it might be useful if I put mine on my webiste, instead of on paper or on my whiteboard.
The post here is deliberately expansive. I wanted it all in one place because I find myself having to remind myself how something worked when I'm trying an alternative approach or reading a paper about somethign similar and need to compare.
The Problem
Before transformers, sequence modeling was dominated by recurrent networks like RNNs and LSTMs.
They worked, but they carried two structural limitations:
- They process tokens sequentially.
- Information has to travel through many intermediate steps.
That creates two familiar problems.
1. Long-range dependencies degrade
Suppose a model is reading:
"The research paper that I read last week, after several dense appendices, was surprisingly clear."
When the model reaches the word clear, it may need information from paper. In a recurrent architecture, that information has to survive many update steps. Even if gating helps, the path between those two words is still long.
2. Training is hard to parallelize
The hidden state at time step depends on the hidden state at time step . That means the model cannot process the whole sequence in one batched matrix multiplication. Throughput suffers.
CNN-based sequence models improve the parallelism story, but they still depend on stacked local receptive fields. Global interaction does not appear immediately. It has to be built up layer by layer.
So the design target becomes:
- every token should be able to interact with every other token
- the model should decide which interactions matter
- the whole computation should be expressible with matrix operations
That target is exactly where attention comes from.
Initial Intuition
My first instinct the first time I thought about this was embarrassingly naive:
Why not just average all token embeddings together and give every token the global average?
That fails immediately. If every token receives the same average, then the representation of each token becomes less specific, not more specific. The sentence collapses into a single blurry summary.
The next idea is closer:
- compare tokens to each other
- measure how relevant one token is to another
- use those relevance scores to form a weighted mixture
That is already the essence of attention.
What remains is to make that idea:
- learnable
- stable
- efficient
- compatible with batched matrix operations
The Core Question
At its heart, self-attention answers one question:
Everything else in the mechanism exists to answer that question in a differentiable, trainable way.
Setup and Notation
Assume an input sequence has tokens, and each token embedding has dimension .
We stack the token embeddings row-wise:
Here:
- is the sequence length
- is the embedding for token
The goal is to transform into a new matrix of context-aware token representations.
Step 1: Project Tokens into Queries, Keys, and Values
The first important move is to stop using the raw embeddings directly for comparison.
Instead, each token is projected into three learned spaces:
- Query: what this token is looking for
- Key: what this token offers
- Value: the information this token contributes if selected
We define three learned matrices:
Then:
This separation is subtle, but it is the first big conceptual leap.
If we used the same embedding for matching and for information transfer, then the model would have to use one representation for two different jobs:
- deciding relevance
- carrying content
Splitting them lets the network learn those roles independently.
Why Queries and Keys?
Suppose token is asking: "who should I pay attention to?"
Its query vector encodes that request.
Every other token exposes a key vector that says what kind of information it contains. If aligns strongly with , then token is relevant to token .
That gives us a natural scoring function:
The dot product is not the only possible similarity measure, but it is simple, differentiable, efficient, and easy to batch.
Step 2: Build the Score Matrix
Now we compute all pairwise query-key interactions at once:
Because:
- has shape
- has shape
their product has shape:
Entry tells us how much token should attend to token before normalization.
This matrix is the learned relationship graph of the sequence.
A Concrete Worked Example
Take a very short token sequence:
Imagine we are updating the representation of sat.
Its query might align strongly with:
cat, because the verb wants its subjectthere, because location information may matter- less strongly with
the, because determiners often contribute less semantic content
If the raw scores for sat are:
then attention will not keep them as raw magnitudes. It will turn them into a normalized distribution.
That is the next step.
Step 3: Scale the Scores
If we stop at , the mechanism works in principle, but training becomes unstable as dimensionality grows.
The standard transformer rescales the scores by :
Why this specific factor?
Because a dot product over dimensions is a sum of many terms. If each term has roughly unit variance, then the variance of the sum grows with .
So as the key/query dimension gets larger:
- raw scores become larger in magnitude
- softmax becomes sharper
- gradients become less well-behaved
Dividing by counteracts that growth.
More formally, if:
and the summands are roughly independent with variance near , then the variance of the sum grows proportionally to . Scaling by brings the magnitude back into a range where softmax behaves more smoothly.
This one line is not cosmetic. It is one of the reasons attention trains reliably at scale.
Step 4: Normalize Row-Wise with Softmax
The model now has a matrix of scaled compatibility scores. But those scores are not yet usable as mixing weights.
We want each token to distribute one unit of attention mass across the sequence.
So we apply softmax across each row:
This means:
Important consequences:
- every entry is nonnegative
- every row sums to
- each row is a probability distribution over tokens
Now row answers the question:
When token looks across the sequence, how much weight should it place on each token?
Step 5: Aggregate the Values
Once we have attention weights, we use them to mix the value vectors.
Since:
the output has shape:
Each output row is:
That is the complete mechanism.
And for self-attention, those , , and all come from the same input sequence .
Interpreting the Output
After attention, token is no longer just its original embedding. It becomes a context-aware mixture of the whole sequence.
This is the decisive difference from older sequence models.
The representation of a token is no longer forced to evolve only through a local sequential state update. It can be rebuilt directly from whichever tokens the model deems relevant.
That makes attention both expressive and parallelizable.
A Shape-First Summary
For reference, the shape flow for a single attention head is:
| Object | Meaning | Shape |
|---|---|---|
| input embeddings | ||
| query projection | ||
| key projection | ||
| value projection | ||
| queries | ||
| keys | ||
| values | ||
| raw scores | ||
| attention weights | ||
| output |
If you remember only one thing operationally, remember this:
Why This Solves the Original Problem
Return to the earlier design target.
Every token can look at every other token
That is built directly into . The mechanism computes all pairwise interactions in one shot.
The model decides what matters
The learned projections , , and are trained end to end. There are no hand-written matching rules.
The computation parallelizes well
Everything is a matrix multiplication or elementwise transformation. GPUs are very good at this pattern.
Causal Masking for Decoder-Only LLMs
So far, the formulation allows every token to attend to every other token.
That is fine for bidirectional encoders. It is not fine for autoregressive language modeling.
In a decoder-only LLM, token must not look at future tokens . Otherwise the model would cheat during training.
The fix is a causal mask.
Before softmax, we add a mask matrix :
where:
Because , masked future positions receive exactly zero attention mass after softmax.
This is the decoder trick that turns generic self-attention into a valid next-token predictor.
Multi-Head Attention
A single attention head gives one learned pattern of interaction. In practice, that is too restrictive.
Different relationships may matter simultaneously:
- subject-verb agreement
- long-range references
- punctuation structure
- local phrase grouping
- induction-like copying behavior
So transformers run several attention heads in parallel.
For head :
Each head computes:
Then the heads are concatenated:
and projected back:
with mapping the concatenated representation back into the model dimension.
The reason this helps is not mystical. It is representational.
Each head gets its own learned subspace for matching and aggregation. That lets the model discover several distinct relational patterns at once instead of forcing all interaction types through one shared score matrix.
Positional Information: Attention Alone Does Not Know Order
Pure attention is permutation-invariant with respect to the set of token embeddings. If you shuffled the tokens but kept the same embeddings, the mechanism itself has no built-in notion of first, second, or last.
That means attention alone does not know sequence order.
So transformers inject positional information. Conceptually, the model does not see just . It sees something like:
where is a positional representation for token position .
This can be done in several ways:
- fixed sinusoidal encodings
- learned positional embeddings
- rotary or relative position mechanisms in newer models
The implementation details vary, but the conceptual role is the same:
Cross-Attention
Self-attention uses one sequence to produce , , and .
Cross-attention mixes two sequences:
- queries come from one sequence
- keys and values come from another
For example:
Then the same attention formula applies:
The difference is not in the equation. It is in where the inputs come from.
Why Dot Products Work So Well
A natural question is: why use a dot product at all?
Because the dot product between and acts as a learned compatibility score:
- if they point in similar directions, the score is large
- if they are orthogonal or opposed, the score is small or negative
That means the model learns a geometry where relevant matches become aligned.
This is one of the cleanest aspects of attention: the mechanism is not just symbolic matching. It is geometric matching in a learned vector space.
Complexity and Trade-Offs
Attention is powerful, but it is not free.
The score matrix has shape , so both compute and memory grow quadratically with sequence length.
This becomes expensive for long contexts. That is why there is so much research into:
- sparse attention
- windowed attention
- linear attention approximations
- FlashAttention-style memory-efficient kernels
But the standard scaled dot-product mechanism remains the reference point because it is conceptually simple and empirically strong.
A Minimal PyTorch Implementation
Below is the cleanest implementation that still preserves the core mechanics.
import torch
import torch.nn as nn
import torch.nn.functional as F
class SelfAttention(nn.Module):
def __init__(self, d_model, d_k=None, d_v=None):
super().__init__()
d_k = d_model if d_k is None else d_k
d_v = d_model if d_v is None else d_v
self.d_k = d_k
self.W_q = nn.Linear(d_model, d_k, bias=False)
self.W_k = nn.Linear(d_model, d_k, bias=False)
self.W_v = nn.Linear(d_model, d_v, bias=False)
def forward(self, x, mask=None):
# x: (batch, seq_len, d_model)
Q = self.W_q(x) # (batch, seq_len, d_k)
K = self.W_k(x) # (batch, seq_len, d_k)
V = self.W_v(x) # (batch, seq_len, d_v)
scores = Q @ K.transpose(-2, -1) # (batch, seq_len, seq_len)
scores = scores / (self.d_k ** 0.5)
if mask is not None:
scores = scores.masked_fill(mask == 0, float("-inf"))
attn = F.softmax(scores, dim=-1) # row-wise over keys
out = attn @ V # (batch, seq_len, d_v)
return out, attn
The essential points are:
K.transpose(-2, -1)aligns the last two dimensions for matrix multiplication- softmax is applied across the sequence axis, not across the feature axis
- the output is a weighted sum of value vectors, not key vectors
Tensor Shapes in Batch Form
If has shape:
then:
That last line is the easiest place to make a mental mistake. The attention matrix mixes across tokens, not across feature dimensions.
A Small Numerical Example
Take one query vector and three key vectors:
Then the raw scores are:
If we apply softmax to , the first token gets the largest weight, the second gets some attention, and the third gets the least.
If the value vectors are:
then the output is the weighted mixture:
This tiny example captures the entire mechanism. Larger models just do it in high-dimensional spaces, with learned projections, across all tokens at once.
What Attention Is Not
There are a few common misunderstandings worth clearing up.
Attention is not just "importance"
An attention weight is not a universal measure of semantic importance. It is a context-dependent routing weight inside a specific head and layer.
Attention is not explanation by default
People often visualize attention maps as if they were direct explanations of model reasoning. Sometimes they are useful hints. They are not the entire story of how the model computes.
Attention is not the whole transformer
A transformer block also includes:
- residual connections
- normalization
- a position-wise feed-forward network
Attention is the routing mechanism, but not the entire architecture.
Intuition That I Keep Coming Back To
The cleanest way I know to remember attention is this:
That framing preserves all the math without reducing it to magic.
Final Summary
The attention mechanism in large language models emerges from a straightforward sequence of ideas:
- compare every token with every other token
- make the comparison learnable using query and key projections
- stabilize the scores with
- normalize row-wise with softmax
- use the resulting weights to mix value vectors
That gives us:
and from that one equation we get:
- global context
- learned routing
- parallel computation
- a practical foundation for modern transformers
This is the core mechanism. Multi-head attention, masking, positional information, and cross-attention are all extensions or constraints around that same core.
For me, the important shift is this: attention is not a mysterious black box layer. It is a structured differentiable lookup over the sequence, where similarity determines routing and routing determines representation.
That is what makes it such a powerful building block for language models.