Details of the BERT Model

By Hiva Mohammadzadeh | Stanford MSCS · UC Berkeley EECS

**TL;DR:** BERT is a pre-trained, encoder-only transformer that reads text bidirectionally -- it sees the full context around every token at once. The base model stacks 12 transformer blocks, each with multi-head self-attention and a position-wise feed-forward network, operating on 128 tokens with a 768-dimensional embedding. Training happens in two phases: masked language modeling (MLM) and next sentence prediction (NSP) for pre-training, then task-specific fine-tuning. A single transformer block runs about 1.86 billion FLOPs, dominated by the feed-forward network. Understanding these dimensions and computations is the key to reasoning about BERT's cost and behavior.

What Makes BERT Different

BERT stands for Bidirectional Encoder Representations from Transformers, and the word that matters most in that name is bidirectional. Before BERT, most language models read text left-to-right (or right-to-left). BERT reads the entire sequence at once. That means when the model processes a token, it has access to every token to its left and every token to its right. The context is complete.

BERT is an encoder-only model. It does not have the decoder stack you see in GPT-style architectures. There is no autoregressive generation here -- the entire input is processed in parallel. This design choice is what makes BERT so effective for understanding tasks: classification, named entity recognition, question answering, and anything else where you need to comprehend the full input before producing an output.

The base configuration I work with uses 128 tokens as the sequence length, an embedding dimension of 768, 12 attention heads, and a feed-forward filter size of 3072. The model consists of 12 stacked transformer encoder blocks, each containing two sub-layers: a multi-head self-attention mechanism and a position-wise fully connected feed-forward network.

BERT: Encoder-Only Architecture Token Input (128 tokens) Token + Position Embeddings (128, 768) x12 Blocks Multi-Head Self-Attention (12 heads, d=64) Add & Layer Norm Feed-Forward Network (768 → 3072 → 768) + GELU Add & Layer Norm Hidden States (128, 768) [CLS] token → classification head residual

BERT stacks 12 identical encoder blocks. Each block: self-attention → add & norm → FFN → add & norm. Bidirectional: every token attends to every other token.


Training: Masked Language Modeling and Next Sentence Prediction

BERT's training happens in two distinct phases.

Pre-Training

The model is pre-trained on large, diverse text corpora using an unsupervised learning approach. Two objectives drive this phase:

Fine-Tuning

After pre-training, you fine-tune BERT on your specific downstream task by adding a task-specific output layer. The pre-trained weights give the model a strong starting point, and the fine-tuning phase adapts those representations to your particular problem. This is what makes BERT so practical -- the expensive pre-training happens once, and fine-tuning is comparatively cheap.


The Architecture

BERT's architecture has two key ingredients: the transformer encoder stack and the task-specific output layer on top.

Encoder-Only Architecture with stacked Encoder Blocks, each containing Feed Forward Neural Network and Self-Attention, with Token Input at bottom and Token Output at top

Encoder-Only Architecture with stacked Encoder Blocks, each containing Feed Forward Neural Network and Self-Attention, with Token Input at bottom and Token Output at top

Each encoder block follows the same pattern: self-attention, followed by a residual connection and layer normalization, followed by a feed-forward network, followed by another residual connection and layer normalization. This pattern repeats 12 times. The output of the final block feeds into whatever task-specific head you have added for fine-tuning.


Inside the Transformer Encoder

Each transformer encoder block has two main components: the attention mechanism and the feed-forward network. But the supporting cast -- residual connections, layer normalization, and the choice of activation function -- is just as important for making the model actually trainable.


Dimensions of the Weight Matrices

Getting the dimensions right is where a lot of confusion lives, so I want to lay these out precisely.

Input Embedding

The input embedding matrix has shape:

(batch_size, sequence_length, embedding_size) = (batch_size, 128, 768)

For parallel attention heads, this gets reshaped to:

(batch_size, 128, 768/12 heads) = (batch_size, 128, 64, 12)

Self-Attention Weight Matrices (Wq, Wk, Wv, Wo)

Each projection weight matrix is:

Matrix Shape
Wq (Query) (batch_size, 768, 768)
Wk (Key) (batch_size, 768, 768)
Wv (Value) (batch_size, 768, 768)
Wo (Output) (batch_size, 768, 768)

Projected Q, K, V Matrices

After multiplying the input embeddings by the weight matrices:

Matrix Full Shape Per-Head Shape
Q (Query) (batch_size, 128, 768) (batch_size, 128, 64, 12)
K (Key) (batch_size, 128, 768) (batch_size, 128, 64, 12)
V (Value) (batch_size, 128, 768) (batch_size, 128, 64, 12)

Feed-Forward Network Weights

Layer Shape
First dense layer (768, 3072)
Second dense layer (3072, 768)

The FFN expands the representation from 768 to 3072 (a 4x expansion), applies GELU, then projects it back down to 768. This expand-then-compress pattern is standard in transformer architectures.


Computation Diagram: The Attention Block

X(128, 768)
WQ, WK, WV(768, 768)
Q, K, V(128, 64, 12)
QKT/√d(128, 128, 12)
Softmaxscores
× V(128, 64, 12)
Concat + WO(128, 768)
+ ResidualLayerNorm

Data flow through one BERT attention head with tensor shapes at each step

Here is the full computation flow through the multi-head self-attention mechanism, using the notation Ed = 768, L = 128, h = 12, d = Ed/h = 64:

Multi-head Self-Attention computation flow showing X(L, Ed) through Wq, Wk, Wv projections, scaled dot-product attention, concatenation, and output projection with residual connection and layer norm

Multi-head Self-Attention computation flow showing X(L, Ed) through Wq, Wk, Wv projections, scaled dot-product attention, concatenation, and output projection with residual connection and layer norm

The step-by-step data flow:

  1. Input: X with shape (L, Ed) = (128, 768)
  2. Project: Multiply by Wq(Ed, d, h), Wk(Ed, d, h), Wv(Ed, d, h) to get Q(L, d, h), K(L, d, h), V(L, d, h)
  3. Transpose K: K becomes KT(h, d, L) for the dot product
  4. Scaled dot-product: Q · KT produces attention scores with shape (L, L, h)
  5. Scale: Divide by √d = √64 = 8
  6. Softmax: Apply softmax to get normalized attention weights, shape (L, L, h)
  7. Apply to values: Multiply scores by V to get zh(L, d, h)
  8. Concatenate heads: Merge the h dimension back to get z(L, Ed)
  9. Output projection: Multiply by Wo(Ed, Ed) to get out(L, Ed)
  10. Residual connection: Add input X to get y(L, Ed)
  11. Layer normalization: Normalize to get the final output(L, Ed)

The Feed-Forward Network

The FFN is a position-wise operation -- it processes each token independently through the same two-layer network.

Position-wise Feed Forward Network computation flow showing y(L, Ed) through linear projection, GELU, second linear projection, residual connection, and layer norm to output (128, 768)

Position-wise Feed Forward Network computation flow showing y(L, Ed) through linear projection, GELU, second linear projection, residual connection, and layer norm to output (128, 768)

The data flow, where Ed = 768, L = 128, and Fl = 3072:

  1. Input: y with shape (L, Ed) = (128, 768) -- this is the attention block output
  2. First linear transformation: W(Ed, Fl) projects to shape (L, Fl) = (128, 3072)
  3. GELU activation: Applied element-wise, shape unchanged at (L, Fl)
  4. Second linear transformation: W(Fl, Ed) projects back to shape (L, Ed) = (128, 768)
  5. Residual connection: Add the original input y
  6. Layer normalization: Final output shape (L, Ed) = (128, 768)

The abstraction is clean: linear projection, nonlinearity, linear projection. The 4x expansion to 3072 gives the network enough capacity to learn useful intermediate representations before compressing back down.


Putting It All Together

The complete BERT encoder block chains attention and FFN with their respective residual connections and normalizations:

Complete BERT architecture computation flow from X(L, Ed) through attention, residual, layer norm, FFN, residual, layer norm to output (128, 768)

Complete BERT architecture computation flow from X(L, Ed) through attention, residual, layer norm, FFN, residual, layer norm to output (128, 768)

Input X(128, 768) flows through self-attention, gets added back (residual), normalized, then flows through the FFN, gets added back again (residual), and normalized one more time. The output is (128, 768) -- same shape as the input. Stack this 12 times and you have BERT Base.


Counting FLOPs

Understanding the computational cost of BERT matters for deployment, hardware planning, and optimization. I assume every dot product requires 1 multiplication and 1 addition (2 FLOPs per multiply-accumulate).

Attention Block: Weight Matrix Multiplications

Multiplying the input embedding matrix by the weight matrices for Q, K, and V:

3 × 128 × 768 × 768 × 2 = 452,984,832 FLOPs

Attention Block: Multi-Head Self-Attention

The dot products within the attention mechanism across 12 heads:

Q · KT: 2 × 12 × 128 × 64 × 128 = 25,165,824 FLOPs
Scores · V: 2 × 12 × 128 × 64 × 128 = 25,165,824 FLOPs
Wo projection: 2 × 128 × 768 × 768 = 150,994,944 FLOPs
Total MHA: 201,326,592 FLOPs

Feed-Forward Network

Two dense layers with the 768 to 3072 expansion and back:

128 × 768 × 3072 × 2 × 2 + 128 = 1,207,959,552 FLOPs

Total Per Transformer Block

FFN: 1,207,959,552
+ MHA: 201,326,592
+ Weight matrices: 452,984,832
= 1,862,270,976 FLOPs per block

For the full 12-block BERT Base, multiply by 12: roughly 22.3 billion FLOPs per forward pass. The feed-forward network dominates the compute at about 65% of the total, which is a useful fact to know when you are thinking about where to optimize.


Historical Context: From RNNs to Transformers to BERT

Before BERT, the NLP world relied on Recurrent Neural Networks (RNNs). RNNs had an encoder that took the input sequence and the previous hidden state to output the next hidden state, and a decoder that generated words one at a time. They worked, but they had serious problems: sequential processing made them slow to train and impossible to parallelize, they required fixed-order input processing, and they struggled with long-range dependencies.

The 2017 "Attention is All You Need" paper changed everything. The Transformer replaced RNNs with self-attention, allowing the model to selectively attend to different parts of the input sequence regardless of position. This made it easier to model long-term dependencies and -- critically -- enabled parallel processing during training.

The original Transformer architecture from Attention is All You Need

The original Transformer architecture (Vaswani et al., 2017). BERT uses only the encoder side (left).

BERT took this encoder architecture and showed that bidirectional pre-training -- reading text in both directions simultaneously -- produced dramatically better representations than left-to-right or right-to-left approaches alone.

Structure of the BERT model showing masked language modeling

BERT's architecture: input tokens (some masked) flow through the Transformer encoder to produce contextualized representations. From my CS199 independent study at UC Berkeley.

Beyond BERT: RoBERTa

It is worth mentioning RoBERTa, which proposed several key improvements to BERT's pre-training:

RoBERTa outperformed BERT on most benchmark NLP tasks, showing that BERT was significantly undertrained and that the training recipe matters as much as the architecture.

This section draws from my CS199 Supervised Independent Study at UC Berkeley.


Key Takeaways