Transformers and Attention: The Architecture That Changed Everything

By Hiva Mohammadzadeh | Stanford MSCS · UC Berkeley EECS

**TL;DR:** The Transformer, introduced in the 2017 paper "Attention is All You Need," replaced recurrent and convolutional architectures with a purely attention-based model. The core idea: instead of processing tokens sequentially, use self-attention to let every token attend to every other token in parallel. This eliminated the sequential bottleneck of RNNs, made long-range dependencies trivial to capture, and enabled massive parallelism during training. The result was faster training, higher accuracy, and a single architecture that spawned BERT, GPT, T5, LLaMA, and essentially every major language model since. This post breaks down the architecture, the three attention patterns, and the three model families that emerged from it.

The Problem with RNNs

Before the Transformer, the dominant paradigm for sequence modeling was the Recurrent Neural Network. RNNs process sequences one token at a time. The encoder takes the input sequence and a hidden state, produces the next hidden state, and passes it forward. The decoder takes that hidden state and generates output tokens one by one. This worked. RNNs could do machine translation, text summarization, and language modeling.

But they had three fundamental problems that I kept running into during my work:

  1. Sequential processing is slow. Each token depends on the hidden state from the previous token. You cannot parallelize this. On modern GPUs designed for massive parallelism, this is a brutal bottleneck. Training a large RNN on a long sequence means waiting for each step to finish before starting the next.

  2. Fixed processing order. The input sequence must be processed left to right (or right to left). You cannot skip around or attend selectively. The model sees "The" before it sees "cat" before it sees "sat." This rigid ordering limits how the model can reason about the input.

  3. Long-range dependencies are extremely hard. The hidden state is a fixed-size vector that must carry all information from previous tokens. By the time you reach token 500, information about token 1 has been compressed and re-compressed through hundreds of nonlinear transformations. In practice, RNNs struggle to connect information that is more than a few dozen tokens apart. The vanishing gradient problem makes this worse -- gradients shrink exponentially as they backpropagate through time, so early tokens receive almost no learning signal.

The attention mechanism introduced in 2015 by Bahdanau et al. ("Neural machine translation by jointly learning to align and translate") was the first major fix. It allowed the decoder to look back at different parts of the source sentence at each decoding step, rather than relying on a single compressed vector. That was a significant improvement, but the underlying architecture was still recurrent. The sequential bottleneck remained.

**The fundamental limitation:** RNNs compress the entire input into a fixed-length hidden state, then process it sequentially. No amount of attention bolted onto an RNN fully solves the parallelism problem.
RNN: Sequential Processing
Token 1h0
h1wait...
Token 2h1
h2wait...
Token 3h2
h3wait...
...O(n) steps
Transformer: Parallel Processing
Token 1attend all
Token 2attend all
Token 3attend all
Token nattend all
All outputsO(1) steps

RNNs must process tokens sequentially -- each step waits for the previous one. Transformers process all tokens in parallel via self-attention, reducing sequential steps from O(n) to O(1).


Enter the Transformer

In 2017, Vaswani et al. published "Attention is All You Need," and the title was not an exaggeration. They threw out recurrence entirely. No RNN cells. No convolutions. The entire model is built on attention mechanisms, feed-forward layers, and residual connections. That is it.

The architecture consists of an encoder and a decoder, each made up of a stack of identical layers. The encoder layers have two sub-layers: a multi-head self-attention mechanism and a position-wise feed-forward network. The decoder layers have three sub-layers: masked multi-head self-attention, multi-head cross-attention over the encoder output, and a position-wise feed-forward network. Every sub-layer is wrapped in a residual connection followed by layer normalization.

The results were immediate and decisive. On the WMT 2014 English-to-German translation task, the Transformer outperformed all previous models by more than 2.0 BLEU, and it did this in just 3.5 days of training. On English-to-French, it reduced training cost to one-quarter of the previous best while also improving the BLEU score. It achieved state-of-the-art perplexity on the One Billion Word Benchmark. It even outperformed existing models on English constituency parsing -- a task it was not specifically designed for and received no task-specific tuning.

The Transformer was not an incremental improvement. It was a paradigm shift.


Self-Attention: The Core Innovation

Self-attention is the mechanism that makes the Transformer work. The idea is straightforward: for each token in the sequence, compute how much it should attend to every other token in the sequence, then use those attention weights to create a weighted combination of all token representations.

Here is how it works. Each token is projected into three vectors: a query (Q), a key (K), and a value (V). The attention score between two tokens is the dot product of the query of the first token and the key of the second token. These scores are scaled (divided by the square root of the key dimension) and passed through a softmax to get attention weights. The output for each token is the weighted sum of all value vectors, where the weights are the attention scores.

Self-Attention: Query, Key, Value Mechanism Input Tokens The cat sat Linear Projections Q K V Q K V Q K V Q x Kᵀ ÷ √d Softmax .8 .1 .1 attention weights Scores x V Output context-aware Attention(Q, K, V) = softmax(QKᵀ / √dₖ) V Query: "What am I looking for?" Key: "What do I contain?" Value: "What info do I provide?"

The self-attention mechanism in detail. Each input token is projected into Query, Key, and Value vectors. The dot product of Q and K (scaled) produces attention weights via softmax. These weights determine how much each token's Value contributes to the output.

**The key intuition:** Queries ask "what am I looking for?" Keys answer "what do I contain?" Values carry "what information do I provide?" The dot product between query and key measures relevance. The weighted sum of values produces a context-aware representation.

This solves all three RNN problems in one shot. First, there is no sequential dependency -- all attention scores can be computed in parallel across all positions. Second, there is no fixed processing order -- every token directly attends to every other token regardless of distance. Third, long-range dependencies are trivial -- a token at position 500 can attend directly to a token at position 1 with the same computational cost as attending to its neighbor. The information does not need to survive hundreds of compression steps.

The model is trained using supervised learning with cross-entropy loss, same as you would train any sequence-to-sequence model. The difference is that the attention-based architecture makes training dramatically faster because the computation is parallelizable.


Multi-Head Attention

A single attention function computes one set of attention weights. But different parts of a sentence carry different types of relationships. The word "it" might need to attend to its antecedent for coreference, while simultaneously attending to a nearby verb for syntactic structure.

Multi-head attention addresses this by running multiple attention functions in parallel. Each "head" has its own learned Q, K, and V projection matrices, so each head can learn to attend to different aspects of the input. One head might capture syntactic relationships. Another might capture semantic similarity. A third might focus on positional proximity.

The outputs of all heads are concatenated and projected through a final linear layer. This gives the model the ability to attend to information at different levels of granularity simultaneously.

In practice, multi-head attention is what gives Transformers their representational power. A single head would force the model to compress all types of relationships into one attention pattern. Multiple heads let it maintain separate, specialized attention patterns that are only combined at the output.


The Encoder-Decoder Architecture

The original Transformer uses an encoder-decoder structure, which builds on the seq2seq framework introduced by Sutskever et al. in 2014 ("Sequence to sequence learning with neural networks"). That original seq2seq model used two RNNs -- one to encode the input into a fixed-length vector, and another to decode that vector into the output. The major limitation was the fixed-length representation bottleneck.

The Transformer encoder-decoder keeps the same high-level structure but replaces everything with attention. The encoder processes the full input sequence using bidirectional self-attention -- every token can attend to every other token. It produces a sequence of representations, one per input token.

The decoder generates output tokens one at a time, using masked self-attention to ensure it can only attend to previously generated tokens (not future ones -- that would be cheating during training). It also uses cross-attention to attend to the encoder's output, allowing each generated token to look at the full input sequence when deciding what to output next.

This architecture is natural for tasks where the input and output are different sequences -- machine translation, summarization, question answering. The encoder understands the input; the decoder produces the output conditioned on that understanding.


The Three Architecture Families

The original Transformer used both an encoder and a decoder. But researchers quickly discovered that you do not always need both. Three distinct architecture families emerged, each optimized for different tasks.

The Three Transformer Architecture Families Encoder-Only (BERT, RoBERTa) Full Input Sequence Bidirectional Self-Attention Fixed-Size Representation Classification, NER, QA Decoder-Only (GPT-2, GPT-3, GPT-4) Tokens (one at a time) Causal (Masked) Self-Attention Next Token Generation Text Generation, Code, Chat Encoder-Decoder (T5, BART, Transformer) Encoder Decoder Cross-Attention Bridge Seq-to-Seq Output Translation, Summarization

The three architecture families that emerged from the Transformer. Encoder-only (BERT) for understanding, Decoder-only (GPT) for generation, Encoder-Decoder (T5) for sequence-to-sequence tasks. Each uses attention differently.

Encoder-Only (BERT)

The encoder-only architecture takes an input sequence, passes it through the encoder stack, and produces a fixed-size representation (a latent vector). There is no decoder. The model does not generate text -- it produces representations.

The attention is bidirectional: every token sees every other token, both before and after it. This gives the model full context in both directions, which is ideal for understanding tasks.

BERT (2018) is the canonical encoder-only model. It was pre-trained using masked language modeling (predicting randomly masked tokens) and next sentence prediction. The result was a model that produces rich, context-aware representations that can be fine-tuned for classification, named entity recognition, sentiment analysis, and other understanding tasks.

Encoder-only models are particularly useful when labeled data is limited. You pre-train on a large unsupervised corpus, then fine-tune on a small labeled dataset. RoBERTa, ELECTRA, and ALBERT are all encoder-only models that followed BERT.

Decoder-Only (GPT)

The decoder-only architecture takes an input and generates an output sequence token by token. There is no separate encoder. The attention is unidirectional: each token can only attend to tokens before it (and itself). This is called autoregressive generation -- the model predicts the next token given all previous tokens.

GPT is the canonical decoder-only model. It generates variable-length output, one token at a time, making it natural for text generation, dialogue, code completion, and open-ended tasks.

The key insight behind decoder-only models is that you do not need a separate encoder if the task is generation. The model learns to both understand the input and generate the output using the same stack of layers.

Encoder-Decoder (T5, BART)

The encoder-decoder architecture combines both components. The encoder processes the input into a fixed-size representation, and the decoder generates the output conditioned on that representation. This is the original Transformer architecture.

It is the natural choice for tasks where the input and output are structurally different: machine translation (input in one language, output in another), image captioning (input is an image, output is text), and summarization (input is long text, output is short text).

**How to choose:** Need to understand and classify? Encoder-only. Need to generate? Decoder-only. Need to transform one sequence into a different sequence? Encoder-decoder. In practice, decoder-only models have become dominant for general-purpose use because they can be scaled effectively and handle a wide range of tasks through prompting.

Attention Patterns: Forward, Causal, and Triangle

Not all attention is created equal. The way you constrain which tokens can attend to which other tokens fundamentally changes what the model can do. Three patterns matter.

Forward Attention (Self-Attention)

In forward attention, each query attends to all keys and values up to and including its current position. It can access the past and present, but not the future. This is the attention pattern used in language modeling, where the goal is to predict the next token given the preceding context.

Forward attention gives the model access to all preceding context while preventing it from "seeing the answer" at future positions. It is the foundation of autoregressive generation.

Causal Attention

Causal attention is similar to forward attention but explicitly includes a mask that prevents each query from attending to keys and values after its current position. In practice, this is implemented as an upper-triangular mask applied to the attention scores before softmax -- setting future positions to negative infinity so they get zero attention weight after softmax.

Causal attention is the standard pattern in decoder-only models like GPT. It ensures that generating token t depends only on tokens 1 through t-1, which is exactly the autoregressive property we need for text generation. The model generates one token at a time, and each token is produced without any knowledge of what comes after it.

Triangle Attention

Triangle attention is a variant where each query only attends to a subset of keys and values, forming a triangular pattern defined by a maximum distance between query and key positions. Instead of attending to all previous tokens, a token at position t might only attend to tokens within some window around it.

The purpose is efficiency. Full self-attention has O(n^2) complexity in the sequence length, which becomes prohibitive for very long sequences. Triangle attention reduces this cost while still capturing long-range dependencies -- the triangular pattern encourages the model to attend to positions that are further away, not just immediate neighbors.


The Impact: What Came Next

The Transformer did not just improve machine translation. It created an entirely new paradigm.

In 2018, BERT showed that bidirectional pre-training on large text corpora using the Transformer encoder produced representations that dominated every NLP benchmark. BERT was followed by RoBERTa (better pre-training recipe), ELECTRA (replaced masking with a discriminator), and ALBERT (parameter-efficient factorization).

GPT (2018) and GPT-2 (2019) showed that decoder-only Transformers trained autoregressively on massive text corpora could generate remarkably coherent text. GPT-3 (2020) showed that scaling this approach to 175 billion parameters produced a model capable of few-shot learning across tasks it was never explicitly trained on.

T5 (2020) unified all NLP tasks into a text-to-text framework using the full encoder-decoder architecture. BART combined denoising pre-training with the encoder-decoder structure for generation tasks.

Then came the instruction-tuning era: InstructGPT, ChatGPT, LLaMA, and the models that brought large language models into mainstream use. Every one of them is a Transformer at its core. The architecture from the 2017 paper -- self-attention, multi-head attention, feed-forward layers, residual connections, layer normalization -- remains fundamentally unchanged. What has changed is scale, training data, and training methodology.

**The 2017 paper's lasting contribution** is not any single technique. It is the demonstration that attention alone, without recurrence or convolution, is sufficient for state-of-the-art sequence modeling. That insight unlocked everything that followed.

Key Takeaways

  1. RNNs had three fatal flaws: sequential processing (slow), fixed order (rigid), and poor long-range dependencies (vanishing gradients). The Transformer fixes all three.

  2. Self-attention is the core mechanism. Each token computes attention scores with every other token using queries, keys, and values. This enables parallel processing and direct long-range connections.

  3. Multi-head attention lets the model attend to different types of relationships simultaneously -- syntactic, semantic, positional -- by running multiple attention functions in parallel.

  4. Three architecture families emerged from the Transformer: encoder-only (BERT, for understanding), decoder-only (GPT, for generation), and encoder-decoder (T5, for sequence-to-sequence tasks).

  5. Three attention patterns control what each token can see: forward attention (past and present), causal attention (masked future, used in autoregressive generation), and triangle attention (windowed subset, for efficiency).

  6. The results were decisive. The original Transformer outperformed all existing models on translation by more than 2.0 BLEU, reduced training cost by 75%, and generalized to tasks it was not designed for.

  7. Every major language model since 2017 -- BERT, GPT, T5, LLaMA, and their descendants -- is a Transformer. The architecture has proven to be the foundation of modern NLP.


Based on my CS199 Supervised Independent Study at UC Berkeley and the original Presentations I created in 2023.