How Attention Works in Large Language Models

Attention is the mechanism that made transformers practical for large-scale language modeling. It is also one of those ideas that looks simple after you have seen it enough times, but feels slippery the first time you come across it.

I recently revisited the self-attention mechanism while trying to build a minimal Transformer from scratch. Like many engineers, I had used libraries where attention is a one-liner , but while I was looking deeper into its implementation, I thought it might be useful if I put mine on my webiste, instead of on paper or on my whiteboard.

The post here is deliberately expansive. I wanted it all in one place because I find myself having to remind myself how something worked when I'm trying an alternative approach or reading a paper about somethign similar and need to compare.

Why attention replaced strictly sequential sequence models

The left path has to pass information token by token. The right path lets every token form direct connections in a single layer.

The Problem

Before transformers, sequence modeling was dominated by recurrent networks like RNNs and LSTMs.

They worked, but they carried two structural limitations:

They process tokens sequentially.
Information has to travel through many intermediate steps.

That creates two familiar problems.

1. Long-range dependencies degrade

Suppose a model is reading:

"The research paper that I read last week, after several dense appendices, was surprisingly clear."

When the model reaches the word clear, it may need information from paper. In a recurrent architecture, that information has to survive many update steps. Even if gating helps, the path between those two words is still long.

2. Training is hard to parallelize

The hidden state at time step $t$ depends on the hidden state at time step $t - 1$ . That means the model cannot process the whole sequence in one batched matrix multiplication. Throughput suffers.

CNN-based sequence models improve the parallelism story, but they still depend on stacked local receptive fields. Global interaction does not appear immediately. It has to be built up layer by layer.

So the design target becomes:

every token should be able to interact with every other token
the model should decide which interactions matter
the whole computation should be expressible with matrix operations

That target is exactly where attention comes from.

Initial Intuition

My first instinct the first time I thought about this was embarrassingly naive:

Why not just average all token embeddings together and give every token the global average?

That fails immediately. If every token receives the same average, then the representation of each token becomes less specific, not more specific. The sentence collapses into a single blurry summary.

The next idea is closer:

compare tokens to each other
measure how relevant one token is to another
use those relevance scores to form a weighted mixture

That is already the essence of attention.

What remains is to make that idea:

learnable
stable
efficient
compatible with batched matrix operations

The Core Question

At its heart, self-attention answers one question:

Everything else in the mechanism exists to answer that question in a differentiable, trainable way.

Setup and Notation

Assume an input sequence has $n$ tokens, and each token embedding has dimension $d_{model}$ .

We stack the token embeddings row-wise:

Input Matrix

X \in \mathbb{R}^{n \times d_{model}}

X = \begin{bmatrix} x_1 \\ x_2 \\ \vdots \\ x_n \end{bmatrix}

Here:

$n$ is the sequence length
$x_i \in \mathbb{R}^{d_{model}}$ is the embedding for token $i$

The goal is to transform $X$ into a new matrix of context-aware token representations.

Step 1: Project Tokens into Queries, Keys, and Values

The first important move is to stop using the raw embeddings directly for comparison.

Instead, each token is projected into three learned spaces:

Query: what this token is looking for
Key: what this token offers
Value: the information this token contributes if selected

From embeddings to queries, keys, and values

Every token embedding is projected three times. Queries decide what to look for, keys expose what each token contains, and values carry the information that is mixed together.

We define three learned matrices:

Learned Projections

W_Q \in \mathbb{R}^{d_{model} \times d_k}, \qquad W_K \in \mathbb{R}^{d_{model} \times d_k}, \qquad W_V \in \mathbb{R}^{d_{model} \times d_v}

Then:

Q, K, and V

Q = XW_Q \in \mathbb{R}^{n \times d_k}

K = XW_K \in \mathbb{R}^{n \times d_k}

V = XW_V \in \mathbb{R}^{n \times d_v}

This separation is subtle, but it is the first big conceptual leap.

If we used the same embedding for matching and for information transfer, then the model would have to use one representation for two different jobs:

deciding relevance
carrying content

Splitting them lets the network learn those roles independently.

Why Queries and Keys?

Suppose token $i$ is asking: "who should I pay attention to?"

Its query vector $q_i$ encodes that request.

Every other token $j$ exposes a key vector $k_j$ that says what kind of information it contains. If $q_i$ aligns strongly with $k_j$ , then token $j$ is relevant to token $i$ .

That gives us a natural scoring function:

Pairwise Compatibility

\text{score}(i, j) = q_i \cdot k_j

The dot product is not the only possible similarity measure, but it is simple, differentiable, efficient, and easy to batch.

Step 2: Build the Score Matrix

Now we compute all pairwise query-key interactions at once:

Attention scores form a learned relationship matrix

Row i answers: if token i is asking the question, how strongly does it align with every token j?

Raw Attention Scores

S = QK^\top

Because:

$Q$ has shape $n \times d_k$
$K^\top$ has shape $d_k \times n$

their product has shape:

Shape of the Score Matrix

S \in \mathbb{R}^{n \times n}

Entry $(i, j)$ tells us how much token $i$ should attend to token $j$ before normalization.

This matrix is the learned relationship graph of the sequence.

A Concrete Worked Example

Take a very short token sequence:

[\texttt{the}, \texttt{cat}, \texttt{sat}, \texttt{there}]

Imagine we are updating the representation of sat.

Its query might align strongly with:

cat, because the verb wants its subject
there, because location information may matter
less strongly with the, because determiners often contribute less semantic content

If the raw scores for sat are:

[1.2,\ 3.1,\ 2.7,\ 2.4]

then attention will not keep them as raw magnitudes. It will turn them into a normalized distribution.

That is the next step.

Step 3: Scale the Scores

If we stop at $QK^\top$ , the mechanism works in principle, but training becomes unstable as dimensionality grows.

The standard transformer rescales the scores by $\sqrt{d_k}$ :

Scaled Dot-Product Attention Scores

\tilde{S} = \frac{QK^\top}{\sqrt{d_k}}

Why this specific factor?

Because a dot product over $d_k$ dimensions is a sum of many terms. If each term has roughly unit variance, then the variance of the sum grows with $d_k$ .

So as the key/query dimension gets larger:

raw scores become larger in magnitude
softmax becomes sharper
gradients become less well-behaved

Dividing by $\sqrt{d_k}$ counteracts that growth.

More formally, if:

q_i \cdot k_j = \sum_{\ell = 1}^{d_k} q_{i\ell} k_{j\ell}

and the summands are roughly independent with variance near $1$ , then the variance of the sum grows proportionally to $d_k$ . Scaling by $\sqrt{d_k}$ brings the magnitude back into a range where softmax behaves more smoothly.

This one line is not cosmetic. It is one of the reasons attention trains reliably at scale.

Step 4: Normalize Row-Wise with Softmax

The model now has a matrix of scaled compatibility scores. But those scores are not yet usable as mixing weights.

We want each token to distribute one unit of attention mass across the sequence.

So we apply softmax across each row:

Attention Weights

A = \text{softmax}\left(\frac{QK^\top}{\sqrt{d_k}}\right)

This means:

A_{ij} = \frac{\exp(\tilde{S}_{ij})}{\sum_{m=1}^{n} \exp(\tilde{S}_{im})}

Important consequences:

every entry is nonnegative
every row sums to $1$
each row is a probability distribution over tokens

A toy attention pattern

A small example makes the row-wise normalization visible. Darker cells indicate larger attention mass.

Now row $i$ answers the question:

When token $i$ looks across the sequence, how much weight should it place on each token?

Step 5: Aggregate the Values

Once we have attention weights, we use them to mix the value vectors.

Weighted Value Aggregation

O = AV

Since:

$A \in \mathbb{R}^{n \times n}$
$V \in \mathbb{R}^{n \times d_v}$

the output has shape:

O \in \mathbb{R}^{n \times d_v}

Each output row is:

o_i = \sum_{j=1}^{n} A_{ij} v_j

That is the complete mechanism.

Self-Attention in One Line

\text{Attention}(Q, K, V) = \text{softmax}\left(\frac{QK^\top}{\sqrt{d_k}}\right)V

And for self-attention, those $Q$ , $K$ , and $V$ all come from the same input sequence $X$ .

Interpreting the Output

After attention, token $i$ is no longer just its original embedding. It becomes a context-aware mixture of the whole sequence.

This is the decisive difference from older sequence models.

The representation of a token is no longer forced to evolve only through a local sequential state update. It can be rebuilt directly from whichever tokens the model deems relevant.

That makes attention both expressive and parallelizable.

A Shape-First Summary

For reference, the shape flow for a single attention head is:

Object	Meaning	Shape
$X$	input embeddings	$n \times d_{model}$
$W_Q$	query projection	$d_{model} \times d_k$
$W_K$	key projection	$d_{model} \times d_k$
$W_V$	value projection	$d_{model} \times d_v$
$Q$	queries	$n \times d_k$
$K$	keys	$n \times d_k$
$V$	values	$n \times d_v$
$QK^\top$	raw scores	$n \times n$
$A$	attention weights	$n \times n$
$O$	output	$n \times d_v$

If you remember only one thing operationally, remember this:

Why This Solves the Original Problem

Return to the earlier design target.

Every token can look at every other token

That is built directly into $QK^\top$ . The mechanism computes all pairwise interactions in one shot.

The model decides what matters

The learned projections $W_Q$ , $W_K$ , and $W_V$ are trained end to end. There are no hand-written matching rules.

The computation parallelizes well

Everything is a matrix multiplication or elementwise transformation. GPUs are very good at this pattern.

Causal Masking for Decoder-Only LLMs

So far, the formulation allows every token to attend to every other token.

That is fine for bidirectional encoders. It is not fine for autoregressive language modeling.

In a decoder-only LLM, token $t_i$ must not look at future tokens $t_{i+1}, t_{i+2}, \dots$ . Otherwise the model would cheat during training.

The fix is a causal mask.

Causal masking prevents peeking into the future

Decoder-only language models can only use tokens at or before the current position, so the upper triangle is masked before softmax.

Before softmax, we add a mask matrix $M$ :

Masked Attention

A = \text{softmax}\left(\frac{QK^\top + M}{\sqrt{d_k}}\right)

where:

M_{ij} = \begin{cases} 0 & \text{if } j \le i \\ -\infty & \text{if } j > i \end{cases}

Because $\exp(-\infty) = 0$ , masked future positions receive exactly zero attention mass after softmax.

This is the decoder trick that turns generic self-attention into a valid next-token predictor.

Multi-Head Attention

A single attention head gives one learned pattern of interaction. In practice, that is too restrictive.

Different relationships may matter simultaneously:

subject-verb agreement
long-range references
punctuation structure
local phrase grouping
induction-like copying behavior

So transformers run several attention heads in parallel.

Multi-head attention learns several relationship types at once

Each head has its own learned Q, K, and V projections. Their outputs are concatenated and projected back into the model space.

For head $h$ :

Q^{(h)} = XW_Q^{(h)}, \qquad K^{(h)} = XW_K^{(h)}, \qquad V^{(h)} = XW_V^{(h)}

Each head computes:

O^{(h)} = \text{softmax}\left(\frac{Q^{(h)} K^{(h)\top}}{\sqrt{d_k}}\right)V^{(h)}

Then the heads are concatenated:

\text{Concat}(O^{(1)}, O^{(2)}, \dots, O^{(H)})

and projected back:

\text{MHA}(X) = \text{Concat}(O^{(1)}, \dots, O^{(H)})W_O

with $W_O$ mapping the concatenated representation back into the model dimension.

The reason this helps is not mystical. It is representational.

Each head gets its own learned subspace for matching and aggregation. That lets the model discover several distinct relational patterns at once instead of forcing all interaction types through one shared score matrix.

Positional Information: Attention Alone Does Not Know Order

Pure attention is permutation-invariant with respect to the set of token embeddings. If you shuffled the tokens but kept the same embeddings, the mechanism itself has no built-in notion of first, second, or last.

That means attention alone does not know sequence order.

Attention alone has no built-in sense of order

Without positional information, the same set of token embeddings can be permuted with no way for the model to know which token came first.

So transformers inject positional information. Conceptually, the model does not see just $x_i$ . It sees something like:

z_i = x_i + p_i

where $p_i$ is a positional representation for token position $i$ .

This can be done in several ways:

fixed sinusoidal encodings
learned positional embeddings
rotary or relative position mechanisms in newer models

The implementation details vary, but the conceptual role is the same:

Cross-Attention

Self-attention uses one sequence to produce $Q$ , $K$ , and $V$ .

Cross-attention mixes two sequences:

queries come from one sequence
keys and values come from another

Cross-attention mixes two sequences instead of one

Queries usually come from the current decoder state, while keys and values come from another sequence such as encoder outputs.

For example:

Q = X_{\text{decoder}} W_Q

K = X_{\text{encoder}} W_K

V = X_{\text{encoder}} W_V

Then the same attention formula applies:

\text{CrossAttention}(Q, K, V) = \text{softmax}\left(\frac{QK^\top}{\sqrt{d_k}}\right)V

The difference is not in the equation. It is in where the inputs come from.

Why Dot Products Work So Well

A natural question is: why use a dot product at all?

Because the dot product between $q_i$ and $k_j$ acts as a learned compatibility score:

if they point in similar directions, the score is large
if they are orthogonal or opposed, the score is small or negative

That means the model learns a geometry where relevant matches become aligned.

This is one of the cleanest aspects of attention: the mechanism is not just symbolic matching. It is geometric matching in a learned vector space.

Complexity and Trade-Offs

Attention is powerful, but it is not free.

The score matrix has shape $n \times n$ , so both compute and memory grow quadratically with sequence length.

Asymptotic Cost

\text{time complexity} \sim O(n^2 d_k)

\text{memory for attention map} \sim O(n^2)

This becomes expensive for long contexts. That is why there is so much research into:

sparse attention
windowed attention
linear attention approximations
FlashAttention-style memory-efficient kernels

But the standard scaled dot-product mechanism remains the reference point because it is conceptually simple and empirically strong.

A Minimal PyTorch Implementation

Below is the cleanest implementation that still preserves the core mechanics.

import torch
import torch.nn as nn
import torch.nn.functional as F


class SelfAttention(nn.Module):
    def __init__(self, d_model, d_k=None, d_v=None):
        super().__init__()
        d_k = d_model if d_k is None else d_k
        d_v = d_model if d_v is None else d_v

        self.d_k = d_k
        self.W_q = nn.Linear(d_model, d_k, bias=False)
        self.W_k = nn.Linear(d_model, d_k, bias=False)
        self.W_v = nn.Linear(d_model, d_v, bias=False)

    def forward(self, x, mask=None):
        # x: (batch, seq_len, d_model)
        Q = self.W_q(x)                       # (batch, seq_len, d_k)
        K = self.W_k(x)                       # (batch, seq_len, d_k)
        V = self.W_v(x)                       # (batch, seq_len, d_v)

        scores = Q @ K.transpose(-2, -1)      # (batch, seq_len, seq_len)
        scores = scores / (self.d_k ** 0.5)

        if mask is not None:
            scores = scores.masked_fill(mask == 0, float("-inf"))

        attn = F.softmax(scores, dim=-1)      # row-wise over keys
        out = attn @ V                        # (batch, seq_len, d_v)

        return out, attn

The essential points are:

K.transpose(-2, -1) aligns the last two dimensions for matrix multiplication
softmax is applied across the sequence axis, not across the feature axis
the output is a weighted sum of value vectors, not key vectors

Tensor Shapes in Batch Form

If $x$ has shape:

(\text{batch}, n, d_{model})

then:

Q, K \in \mathbb{R}^{\text{batch} \times n \times d_k}

V \in \mathbb{R}^{\text{batch} \times n \times d_v}

QK^\top \in \mathbb{R}^{\text{batch} \times n \times n}

AV \in \mathbb{R}^{\text{batch} \times n \times d_v}

That last line is the easiest place to make a mental mistake. The attention matrix mixes across tokens, not across feature dimensions.

A Small Numerical Example

Take one query vector and three key vectors:

q = [1, 0]

k_1 = [1, 0], \qquad k_2 = [0.5, 0.5], \qquad k_3 = [0, 1]

Then the raw scores are:

q \cdot k_1 = 1, \qquad q \cdot k_2 = 0.5, \qquad q \cdot k_3 = 0

If we apply softmax to $[1, 0.5, 0]$ , the first token gets the largest weight, the second gets some attention, and the third gets the least.

If the value vectors are:

v_1 = [3, 0], \qquad v_2 = [1, 1], \qquad v_3 = [0, 2]

then the output is the weighted mixture:

o = a_1 v_1 + a_2 v_2 + a_3 v_3

This tiny example captures the entire mechanism. Larger models just do it in high-dimensional spaces, with learned projections, across all tokens at once.

What Attention Is Not

There are a few common misunderstandings worth clearing up.

Attention is not just "importance"

An attention weight is not a universal measure of semantic importance. It is a context-dependent routing weight inside a specific head and layer.

Attention is not explanation by default

People often visualize attention maps as if they were direct explanations of model reasoning. Sometimes they are useful hints. They are not the entire story of how the model computes.

Attention is not the whole transformer

A transformer block also includes:

residual connections
normalization
a position-wise feed-forward network

Attention is the routing mechanism, but not the entire architecture.

Intuition That I Keep Coming Back To

The cleanest way I know to remember attention is this:

That framing preserves all the math without reducing it to magic.

Final Summary

The attention mechanism in large language models emerges from a straightforward sequence of ideas:

compare every token with every other token
make the comparison learnable using query and key projections
stabilize the scores with $\sqrt{d_k}$
normalize row-wise with softmax
use the resulting weights to mix value vectors

That gives us:

\text{Attention}(Q, K, V) = \text{softmax}\left(\frac{QK^\top}{\sqrt{d_k}}\right)V

and from that one equation we get:

global context
learned routing
parallel computation
a practical foundation for modern transformers

This is the core mechanism. Multi-head attention, masking, positional information, and cross-attention are all extensions or constraints around that same core.

For me, the important shift is this: attention is not a mysterious black box layer. It is a structured differentiable lookup over the sequence, where similarity determines routing and routing determines representation.

That is what makes it such a powerful building block for language models.