Every token in a 512-token sequence had to wait for the one before it. That single architectural choice -- sequential processing in RNNs -- was the ceiling on how large and how capable language models could become. The Transformer, introduced in "Attention Is All You Need" (Vaswani et al., 2017), removed that ceiling by letting every token communicate directly with every other token, simultaneously.
What followed was not incremental progress. Within two years, the architecture had displaced RNNs, LSTMs, and CNNs across virtually every domain in deep learning. Understanding why requires tracing the exact failure modes it fixed.
Part I: The Problem
Why sequences are hard
Language is sequential. "The cat sat on the mat" has meaning because of its order. Shuffle the words and you get nonsense. Any model that processes language needs to understand both the individual tokens and their relationships across positions.
Before transformers, the dominant approach was the Recurrent Neural Network (RNN). The idea was natural: read one token at a time, maintain a hidden state that summarizes everything seen so far, and update that state with each new token.
The sequential bottleneck
The fundamental problem with RNNs is that token N cannot be processed until tokens 1 through N-1 are finished. This creates two issues:
1. Vanishing gradients. As the hidden state passes through dozens or hundreds of steps, gradients shrink exponentially during backpropagation. By the time the gradient signal reaches early tokens, it's effectively zero. The model forgets.
2. No parallelism. Each step depends on the previous step's output. You can't split the work across GPU cores. For a 512-token sequence, you wait 512 sequential steps. This made training painfully slow.
LSTMs and GRUs mitigated the gradient problem with gating mechanisms -- dedicated "forget" and "update" gates that control information flow. But they couldn't fix the parallelism problem. Training remained sequential.
The diagram below illustrates the difference. Watch the RNN process tokens one-by-one, while the Transformer handles all tokens simultaneously.
The parallelism gap matters more than it might seem. On modern hardware, the difference between sequential and parallel processing isn't just speed -- it determines whether you can train on enough data to make the model capable. Parallelism is what made scale possible. And scale, as it turned out, was what made intelligence emerge.
Part II: The Attention Mechanism
The core insight
The RNN's problem is information compression: it tries to squeeze an entire sequence into a single fixed-size vector. By the time the model processes token 512, most of what happened at token 1 is gone.
What if, instead of compressing an entire sequence into a single hidden state vector, we let each token look directly at every other token and decide what's relevant?
This is attention. For each token in a sequence, we compute a set of weights over all other tokens, then take a weighted sum. The word "cat" might attend strongly to "sat" (its verb) and weakly to "the" (a generic determiner). These weights are learned, not hardcoded.
Query, Key, Value
The mechanism works through three learned projections. Every input token embedding gets transformed into three vectors:
- Query (Q) -- "What am I looking for?" This represents the current token's question to the rest of the sequence.
- Key (K) -- "What do I contain?" This represents what each token offers as a match.
- Value (V) -- "What information do I carry?" This is the actual content that gets aggregated.
The attention weight between token i and token j is the dot product of Q_i and K_j, scaled by the square root of the key dimension, then passed through softmax. The output for each token is the weighted sum of all Value vectors.
Why three separate projections? Because what a token is searching for (Query) is different from what it advertises (Key), which is different from the information it actually carries (Value). A word like "it" might have a Query that searches for nouns, a Key that says "pronoun", and a Value that carries contextual features.
In code, the full computation for a single attention head looks like this:
The scaling by d_k ** 0.5 is not cosmetic. Without it, large dot products push the softmax into regions with near-zero gradients, making training unstable. The scale keeps variance consistent regardless of embedding dimension.
Step-by-step computation
Let's trace through the full attention computation for a short sequence: "I love ML !"
Step 1 -- Input Embeddings. Each token is converted into a dense vector. These are the raw representations the model starts with.
Step 2 -- Q, K, V Projection. Each embedding is multiplied by three learned weight matrices (W_Q, W_K, W_V) to produce Query, Key, and Value vectors. These live in separate subspaces optimized for matching (Q, K) and carrying information (V).
Step 3 -- Attention Scores. We compute the dot product between every Query and every Key, then scale by the square root of the key dimension. High scores mean a strong match between what the Query is looking for and what the Key offers.
Step 4 -- Softmax. The raw scores are passed through softmax row-wise, converting them to probabilities. Each row now sums to 1.0 -- these are the attention weights.
Step 5 -- Weighted Sum. Finally, we multiply the attention weights by the Value vectors. Each token's output is a weighted combination of all Value vectors, with weights determined by how relevant each token is.
Seeing attention weights
Once computed, the attention weights form a matrix. Each row corresponds to a Query token, each column to a Key token. The matrix reveals which tokens the model considers important for understanding each word.
Hover over cells to inspect weights, or hover over a row label to see what that word attends to:
Notice the patterns: "cat" attends heavily to "sat" (syntactic dependency -- subject-verb). "mat" attends to "on" and "the" (local context). The diagonal shows self-attention -- each token maintains some focus on itself. These patterns emerge purely from training, not from any explicit linguistic rules.
Part III: Multi-Head Attention
A single attention computation captures one type of relationship. But language encodes many relationships simultaneously:
- Syntactic -- subject-verb agreement, modifier attachment
- Semantic -- meaning similarity, coreference (what does "it" refer to?)
- Positional -- nearby words are often more relevant
- Long-range -- a pronoun on page 3 might refer to a character introduced on page 1
One head cannot optimize for all of these at once. The Query-Key dot product is a single scalar -- it can only rank relevance along one axis at a time.
Multi-head attention addresses this by running multiple attention computations in parallel, each with its own learned Q, K, V projection matrices. If we have h heads, each head gets its own W_Q, W_K, W_V of reduced dimension (d_model / h). The heads operate independently, then their outputs are concatenated and projected through a final weight matrix.
In practice, different heads learn to specialize. One head might track syntactic dependencies. Another might focus on adjacent tokens. A third might capture long-range references. The concatenated output integrates all these perspectives.
The original Transformer uses 8 attention heads with d_model = 512, giving each head a 64-dimensional subspace. Modern large language models use 32, 64, or even 128 heads.
Part IV: Positional Encoding
The order problem
Attention treats its input as a set, not a sequence. The dot product between Q and K is symmetric with respect to position -- "cat sat" and "sat cat" would produce identical attention weights. But word order matters enormously in language.
The solution: add positional information to the input embeddings before feeding them to the transformer.
Sinusoidal encodings
The original Transformer uses sinusoidal functions at different frequencies. For each position and each dimension of the embedding, a unique value is computed using sine (even dimensions) and cosine (odd dimensions) at geometrically increasing wavelengths.
The intuition: each dimension encodes position at a different resolution. Low-numbered dimensions oscillate rapidly (like the "seconds hand" of a clock), while high-numbered dimensions change slowly (like the "hours hand"). Together, they give each position a unique fingerprint.
Explore the pattern below. Hover over any cell to inspect its exact value. Drag the slider to see how the pattern extends across longer sequences.
Notice how the left columns (early positions) show rapid alternation in the top rows (low dimensions), while the bottom rows (high dimensions) change gradually. This multi-frequency representation lets the model learn both local and global position relationships.
A key property: the encoding for position p+k can be expressed as a linear function of the encoding at position p, for any fixed offset k. This means the model can learn relative positions, not just absolute ones.
Part V: The Encoder
The Transformer encoder processes the entire input sequence in parallel and produces a set of contextual representations -- one vector per token, enriched with information from every other token.
Encoder block structure
Each encoder block contains two sub-layers:
- Multi-Head Self-Attention -- every token attends to every other token in the input
- Position-wise Feed-Forward Network -- two linear transformations with a ReLU activation, applied independently to each position
The feed-forward network is often overlooked but carries most of the model's parameters. Where attention routes information between tokens, the FFN transforms each token's representation in isolation. Recent interpretability work suggests the FFN layers store factual associations -- the "knowledge" of the model -- while attention handles routing and retrieval.
Each sub-layer is wrapped with:
- A residual connection -- the input to the sub-layer is added to its output, so gradients flow directly backward without degradation
- Layer normalization -- stabilizes training by normalizing activations
The residual connection is critical. Without it, stacking 6 encoder blocks would cause gradients to vanish or explode before reaching early layers. With it, the gradient can bypass any sub-layer entirely, and the model learns to use each block only when it helps.
The encoder stacks N of these blocks (N=6 in the original paper). Each block refines the representations further. By the final layer, each token's vector captures rich contextual information from the entire sequence.
Hover over each component to understand its role:
Part VI: The Decoder
The decoder generates output tokens one at a time, autoregressively -- each new token is predicted based on the encoder output and all previously generated tokens.
Masked self-attention
The decoder has a critical constraint: when generating token N, it must not attend to tokens N+1, N+2, etc., because those tokens don't exist yet at generation time. This is enforced by masking -- setting attention weights to negative infinity (before softmax) for all future positions, which zeros them out after softmax.
The diagram below compares encoder self-attention (full visibility) with decoder masked self-attention (causal mask). Click "Generate" to watch tokens appear one by one, with the mask growing as each token is produced:
Cross-attention
The decoder's second attention sub-layer is cross-attention -- it attends to the encoder's output. Here, the Queries come from the decoder's current state, but the Keys and Values come from the encoder. This is how the decoder "reads" the source sequence.
For machine translation, this means: the decoder generating French tokens sends Queries asking "what part of the English input is relevant for the next French word?" and gets back weighted combinations of the English representations.
Decoder block structure
Each decoder block has three sub-layers:
- Masked Multi-Head Self-Attention -- attends to previously generated tokens only
- Multi-Head Cross-Attention -- attends to encoder output (K, V from encoder)
- Position-wise Feed-Forward Network -- same as in the encoder
Each with residual connections and layer normalization. The decoder also stacks N blocks.
After the final decoder block, a linear projection maps to vocabulary size, followed by softmax to get a probability distribution over the next token.
Part VII: The Full Architecture
Now we can see the complete picture -- encoder and decoder working together. The encoder processes the full input sequence in parallel. Its output feeds into every decoder block via cross-attention. The decoder generates output tokens autoregressively, attending to both its own previous outputs and the encoder's representations.
Hover over any component to learn what it does. Notice the cross-attention arrow -- this is the bridge between understanding the input and generating the output:
Part VIII: Training
Teacher forcing
During training, the decoder doesn't generate tokens autoregressively. Instead, it receives the correct target sequence (shifted right by one position) as input -- a technique called teacher forcing. This means every decoder position is trained in parallel, just like the encoder.
The mask ensures that position i only sees target tokens 0 through i-1, maintaining the autoregressive constraint even during parallel training.
The loss function
The model is trained with cross-entropy loss between the predicted probability distribution (softmax output at each position) and the true next token. The loss is summed across all positions in the target sequence.
This is the payoff of the Transformer's design. Training a 6-layer encoder-decoder over a million sentence pairs means running this loop in parallel over all target positions at once. An RNN would need a sequential loop over each output token.
Learning rate schedule
The original Transformer uses a custom learning rate schedule: linear warmup for the first few thousand steps, then decay proportional to the inverse square root of the step number. This was critical for stable training.
The warmup exists because Adam's variance estimates are unreliable at the start of training when it has seen only a few batches. A high learning rate during this period sends the model far from a good initialization. The warmup buys time for those estimates to stabilize.
Part IX: How the Architecture Fractured Into Variants
The original Transformer was designed for sequence-to-sequence tasks like machine translation, where you have both a source and a target sequence. But once researchers understood what each component was doing, they started removing pieces.
The encoder is optimized for understanding. It attends bidirectionally -- every token sees every other token. This produces rich representations but cannot generate text, because generation requires a causal mask. BERT (2018) took the encoder alone and pre-trained it by randomly masking input tokens and predicting them from context. The result was a model that excelled at classification, named entity recognition, and question answering -- tasks where you process a full input and produce a label.
The decoder is optimized for generation. GPT (2018) took the decoder alone and pre-trained it with next-token prediction on massive text corpora. No cross-attention needed -- the decoder attends only to its own previous outputs. This is the architecture behind every modern large language model, including GPT-4 and Claude. The encoder-decoder design that was central to the original paper turns out to be unnecessary for general-purpose language models.
The architecture also left the text domain entirely. Vision Transformers (ViT, 2020) replaced text tokens with 16x16 image patches, added a positional encoding over patch positions, and fed them through a standard encoder. The model matched or beat convolutional networks on image classification, despite having no spatial inductive bias baked in. Attention learned to model spatial relationships from data alone.
What made this branching possible is the Transformer's modularity. Self-attention, cross-attention, feed-forward networks, and residual connections are independent building blocks. The original paper combined them in one way. The field then spent several years discovering which combinations work for which problems -- and found that the individual pieces were more general than the paper's particular assembly suggested.
The constraint that turned out not to matter was the encoder-decoder split. The constraint that turned out to matter enormously was scale. More layers, more heads, more data, more compute -- and the decoder-only architecture absorbed it all. That scaling behavior, more than any individual architectural choice, is what makes these models do what they do.
