Details of the LLaMA Model (Large Language Model Meta AI)

By Hiva Mohammadzadeh | Stanford MSCS · UC Berkeley EECS

**TL;DR:** LLaMA proves that you do not need the biggest model to get the best results -- you need a smaller model trained on more and better data for longer. It is a family of encoder-decoder transformers (7B to 65B parameters) trained entirely on publicly available datasets, designed to be cheap at inference. The key architectural changes over the original transformer: RMSNorm for pre-normalization, SwiGLU activation in the feed-forward network (which introduces a third weight matrix), and Rotary Positional Embeddings (RoPE) instead of absolute position encodings. These choices make LLaMA faster, more stable to train, and fully open-source compatible.

The LLaMA Philosophy: Open Data, Efficient Inference

The central insight behind LLaMA is counterintuitive: the best-performing language models are not the largest ones. They are smaller models trained on more data, better data, and for longer. This flips the scaling mindset that dominated the field for years. Instead of throwing more parameters at the problem, LLaMA asks: what if we throw more compute at training a reasonably sized model?

LLaMA is a collection of foundation language models ranging from 7B to 65B parameters. It works the way all autoregressive models work -- it takes a sequence of words as input and predicts the next word, recursively generating text one token at a time. What sets it apart is the training philosophy and the architectural refinements that make it efficient.

Two commitments define LLaMA. First, the model is trained exclusively on publicly available data: CommonCrawl, C4, GitHub, Wikipedia, and others. No proprietary datasets. This makes LLaMA fully compatible with open-sourcing, which is why it became the backbone of the open-source LLM ecosystem. Second, the goal is to make models that are cheaper at inference. A 13B model that matches GPT-3's performance at a fraction of the serving cost is more valuable in practice than a 175B model that requires a cluster to run.

LLaMA model family overview showing parameter counts and training data sources

LLaMA model family overview showing parameter counts and training data sources


Training: Pre-Training and Fine-Tuning

LLaMA's training follows the standard two-phase approach, but the details of the pre-training phase matter.

Pre-Training is unsupervised. The model is trained to predict the next token in a sequence given all past and present tokens. The training data spans the 20 languages with the most speakers, focusing on those using Latin and Cyrillic alphabets. The scale and diversity of the pre-training corpus is what allows smaller models to punch above their weight.

Fine-Tuning adapts the pre-trained model to specific downstream tasks by adding a task-specific output layer. The expensive pre-training happens once; fine-tuning is comparatively cheap and fast.


Key Innovations

LLaMA does not invent a new architecture from scratch. It takes the original transformer and makes surgical modifications that improve training stability, convergence speed, and computational efficiency. Three changes stand out.

LLaMA: Three Key Modifications Standard Transformer Self-Attention Post-Norm (LayerNorm) FFN + ReLU/GELU (768 → 3072 → 768) Post-Norm (LayerNorm) Absolute Position Embed (added once at input) LLaMA Modifications Self-Attention Pre-RMSNorm simpler, faster FFN + SwiGLU (4096 → 11008 → 4096) 3 weight matrices (not 2 like standard) RoPE Rotary Position Embedding at every layer (not just input) Pre-RMSNorm

LLaMA's three key changes: Pre-normalization with RMSNorm (faster than LayerNorm), SwiGLU activation with 3 weight matrices in the FFN, and RoPE applied at every layer instead of absolute embeddings at input only.

RMSNorm: Pre-Normalization for Stability

The original transformer applies Layer Normalization after each sub-layer (post-norm). LLaMA switches to pre-normalization: it normalizes the input of each sub-layer before the computation, not the output. And instead of standard Layer Norm, it uses RMSNorm.

RMSNorm is an extension of Layer Norm that drops the re-centering step entirely. Instead of computing both mean and variance, it only computes the root mean square of the activations across all feature dimensions, producing a single scalar value per example. That scalar normalizes the activations, and a learnable scale parameter is applied afterward:

RMS(a) = √( (1/n) ∑ ai2 )
i = (ai / RMS(a)) · gi

Why this matters in practice: RMSNorm is more effective than Layer Norm when your data has high variance, which is common in large-scale language modeling. It gives the model re-scaling invariance and implicit learning rate adaptation. And because it only needs one pass through the activations (no separate mean computation), it is computationally simpler and faster. At the scale LLaMA operates, that efficiency adds up.

RMSNorm normalization formula and computation flow

RMSNorm normalization formula and computation flow

SwiGLU: A Better Activation Function

LLaMA replaces the standard ReLU (or GeLU) activation in the feed-forward network with SwiGLU -- a combination of the Swish activation function and the Gated Linear Unit (GLU).

The building blocks:

FFNSwish(x, W1, W2) = Swish1(xW1) W2
GLU(x, W, V, b, c) = σ(xW + b) ⊗ (xV + c)
FFNSwiGLU(x, W, V, W2) = (Swish1(xW) ⊗ xV) W2

SwiGLU is smoother than ReLU, which translates to better performance and faster convergence. The gating mechanism allows it to capture complex nonlinear relationships that a simple point-wise activation cannot. The practical consequence is that the feed-forward network now has three weight matrices instead of two -- more on this below, because it changes the dimension calculations significantly.

SwiGLU activation function formulation and comparison to ReLU

SwiGLU activation function formulation and comparison to ReLU

Rotary Positional Embeddings (RoPE)

LLaMA removes absolute positional embeddings entirely and replaces them with Rotary Positional Embeddings (RoPE) applied at every layer of the network.

RoPE encodes absolute positional information using a rotation matrix while naturally incorporating explicit relative position dependency in the self-attention formulation. The idea is elegant: instead of adding a position vector to the embeddings once at the input, you rotate the query and key vectors by an angle proportional to their position in the sequence:

fq,k(xm, m) = R(mθ) · W · xm
where R(mθ) is the 2D rotation matrix [cos mθ, -sin mθ; sin mθ, cos mθ]

The rotation means that the dot product between any two position-encoded vectors depends only on their relative distance, not their absolute positions. This gives the model a natural sense of distance between tokens without the rigid absolute position encodings that limit generalization to longer sequences.

RoPE rotation matrix applied to Query and Key vectors at position m with angle mθ

RoPE rotation matrix applied to Query and Key vectors at position m with angle mθ


The Architecture: Modified Transformer

LLaMA builds on the encoder-decoder transformer from "Attention Is All You Need" with the three modifications above baked in. The optimizer is AdamW. Beyond the architectural changes, the authors also apply several efficiency techniques: optimized multi-head attention to reduce memory usage and runtime, checkpointing activations (saving activations during the forward pass so they do not need to be recomputed during the backward pass), and model and sequence parallelism to reduce memory consumption across devices.

LLaMA Encoder-Decoder architecture showing RMS Norm before Self-Attention, RMS Norm before Cross Attention, RMS Norm before Feed Forward, with RoPE embeddings at input

LLaMA Encoder-Decoder architecture showing RMS Norm before Self-Attention, RMS Norm before Cross Attention, RMS Norm before Feed Forward, with RoPE embeddings at input

The pattern in each encoder or decoder block is: RMS Norm before every sub-layer (self-attention, cross-attention, feed-forward), with the SwiGLU activation inside the feed-forward network and RoPE applied to the attention queries and keys. Residual connections wrap each sub-layer as usual.


The SwiGLU Feed-Forward Network in Detail

SwiGLU Feed-Forward Network (3 weight matrices)
RMS Normy (128, 4096)
W1: Linear(4096→11008)
W3: Linear(4096→11008)
SwiGLUgate × input
W2: Linear(11008→4096)
+ Residual(128, 4096)

Hidden dim = ⌈2/3 × 4 × 4096⌉256 = 11008. The gated path (W3) controls information flow through the Swish activation.

This is where things get interesting from an implementation perspective. The feed-forward network in LLaMA is not your standard two-matrix FFN. Because SwiGLU requires a gating path, there are three weight matrices instead of two.

Here are the dimensions for the base configuration (sequence length L = 128, embedding dimension Ed = 4096):

The hidden dimension calculation is a detail worth memorizing:

hidden_dim = 2/3 × (4 × 4096) = 10922.7
→ Round to nearest multiple of 256 = 11008

That factor of 2/3 is there specifically because SwiGLU adds a third weight matrix. Standard transformers use a hidden dimension of 4d (where d is the model dimension). SwiGLU's gating mechanism adds roughly 50% more parameters in the FFN, so the hidden dimension is scaled down by 2/3 to keep the total parameter count comparable. The rounding to a multiple of 256 is a hardware optimization -- it aligns matrix dimensions for efficient GPU computation.

The feed-forward block works like this:

  1. The input y (shape L × Ed = 128 × 4096) goes through two parallel linear projections, both mapping from Ed to the hidden dimension Fl (4096 → 11008), using column parallelism.
  2. One path applies the SwiGLU activation. The other path passes through unchanged.
  3. The two paths are element-wise multiplied together -- this is the gating operation.
  4. The result passes through a final linear projection from Fl back to Ed (11008 → 4096), producing the output (128 × 4096).
Block diagram: Input → RMS Norm → two parallel Linear(4096, 11008) paths → SwiGLU on one path → element-wise multiply → Linear(11008, 4096) → output

Block diagram: Input → RMS Norm → two parallel Linear(4096, 11008) paths → SwiGLU on one path → element-wise multiply → Linear(11008, 4096) → output


Weight Matrix Dimensions

For the LLaMA-7B configuration with sequence length L = 128, embedding dimension Ed = 4096, 32 attention heads (h = 32), head dimension d = 128, and FFN hidden dimension Fl = 11008:

Input Embedding: - Shape: (batch_size, 128, 4096) - Per attention head: (batch_size, 128, 4096/32) = (batch_size, 128, 128, 32)

Self-Attention Weight Matrices (Wq, Wk, Wv, Wo): - All four: (batch_size, 4096, 4096)

Query, Key, Value Matrices: - Full: (batch_size, 128, 4096) - Per head: (batch_size, 128, 128, 32)

Feed-Forward Network (three weight matrices):

Matrix Shape Parallelism
First dense (gate path) (4096, 11008) Column parallel
Third dense (input path) (4096, 11008) Column parallel
Second dense (output) (11008, 4096) Row parallel

The fact that there are three weight matrices in the FFN instead of the standard two is the direct consequence of the SwiGLU gating mechanism. The first and third matrices create the two parallel paths; the second matrix projects back to the model dimension.


Full Computation Flow

Encoder

The encoder processes the full input sequence with the following flow (L = 128, Ed = 4096, h = 32, d = 128, Fl = 11008):

  1. Input X (L, Ed) passes through RMS Norm.
  2. Multi-head self-attention with RoPE on queries and keys. The attention is computed with optimized memory usage (column-parallel projections for Q, K, V).
  3. Residual connection adds the attention output back to the input.
  4. The result passes through another RMS Norm.
  5. The SwiGLU feed-forward network: two column-parallel linear projections (4096 → 11008), SwiGLU gating, then a row-parallel projection (11008 → 4096).
  6. Residual connection adds the FFN output back.
Computation diagram: Full LLaMA encoder flow from X(L,Ed) through RMS Norm, self-attention with RoPE, residual, RMS Norm, SwiGLU FFN with parallel linear paths, residual, to output

Computation diagram: Full LLaMA encoder flow from X(L,Ed) through RMS Norm, self-attention with RoPE, residual, RMS Norm, SwiGLU FFN with parallel linear paths, residual, to output

Decoder

The decoder adds masked self-attention and cross-attention, all with pre-normalization:

  1. Input passes through RMS NormMasked self-attention (causal mask prevents attending to future tokens) → residual connection.
  2. The result passes through RMS NormCross-attention (queries come from the decoder output; keys and values come from the encoder hidden states) → residual connection.
  3. The result passes through RMS NormSwiGLU feed-forward network (same structure as the encoder FFN) → residual connection.
Computation diagram: Full LLaMA decoder flow with Masked Self-Attention, Cross Attention (Q from decoder, K/V from encoder), and SwiGLU FFN, all with RMS Norm pre-normalization

Computation diagram: Full LLaMA decoder flow with Masked Self-Attention, Cross Attention (Q from decoder, K/V from encoder), and SwiGLU FFN, all with RMS Norm pre-normalization


Key Takeaways

Here is what I take away from the LLaMA architecture as a practitioner:


This post is based on my presentation "Details of the LLaMA Model (Large Language Model Meta AI)" at Berkeley EECS. The original slides, including all architecture diagrams and computation flows, are available in Details_of_the_LLaMA_Model.pdf.