Survey of Normalization Techniques

Hiva Mohammadzadeh · Stanford MSCS · UC Berkeley EECS Deep Learning Training Normalization
TL;DR: Normalization keeps your network's activations (or weights) in a well-behaved range so training stays fast and stable. Batch Norm works great for CNNs with large batches. Layer Norm is the default for Transformers and RNNs. RMSNorm is the cheaper, modern drop-in that skips re-centering. Group Norm is your escape hatch when batch sizes are small. Instance Norm shines in style transfer, and Weight Norm smooths the loss landscape by normalizing weights directly. Pick the one that matches your architecture and batch regime.

Why Normalization Matters

If you have ever watched a training run diverge at epoch 3 or plateau for hours, the culprit is often un-normalized activations. I keep coming back to a handful of reasons why normalization is one of the first things I reach for:

The short version: normalization accelerates and stabilizes learning, full stop. The longer version is choosing which normalization to use, and that is what the rest of this post is about.

The Big Picture: What Normalization Does

Every normalization technique follows roughly the same template:

output = (input - mean) / std · γ + β

You compute a mean and standard deviation over some set of dimensions, normalize, then let the network learn a scale (γ) and shift (β) to recover any representation it needs. The techniques differ only in which dimensions you compute those statistics over. That single design choice changes everything about when and where a method works well.

Normalization dimensions at a glance

Normalization dimensions at a glance

Layer Norm

Layer Norm normalizes each individual example across all of its features. For a single sample in the batch, you compute the mean and variance over the entire [C, H, W] volume (or, in a Transformer, across the hidden dimension).

pout = (pin - μt) / σt · γe + βe

What I like about Layer Norm:

When to reach for it: Layer Norm is the default in Transformers and works well in RNNs. If you are building anything sequence-to-sequence, start here.
Layer Norm: per-example normalization

Layer Norm: per-example normalization

Batch Norm

Batch Norm goes the other direction: instead of normalizing within a single example, it normalizes across the mini-batch dimension for each feature channel. You subtract the batch mean and divide by the batch standard deviation, then apply learned γ and β per channel.

pout = (pin - μc) / σc · γc + βc

Batch Norm eases optimization and enables very deep networks to converge. It also serves as a regularization technique because the per-batch statistics inject noise.

The problems I keep running into

Batch Norm error vs. batch size

Batch Norm error vs. batch size

Batch Norm algorithm recap: Given a mini-batch B = {x1 ... xm} and learnable parameters γ, β: (1) compute mini-batch mean, (2) compute variance, (3) normalize with a small stability constant ε, (4) scale and shift. Full algorithm on page 10 of the PDF.

Bottom line: Batch Norm is still the champion for CNN tasks with large, fixed-size batches. Outside that sweet spot, look elsewhere.

Group Norm

Group Norm is the batch-size-independent alternative I reach for whenever Batch Norm is not viable. It divides the channels of each training example into G groups and computes mean and variance within each group.

μi = (1/m) ∑ xk
σi = √( (1/m) ∑ (xk - μi)2 + ε )

The key insight: because statistics are computed per-example, there is zero dependence on batch size. Whether you are running batch size 2 or 64, Group Norm gives you the same behavior.

Two special cases worth memorizing:

Here is the TensorFlow implementation straight from the original paper -- it is surprisingly compact:

def GroupNorm(x, gamma, beta, G, eps=1e-5):
    N, C, H, W = x.shape
    x = tf.reshape(x, [N, G, C // G, H, W])
    mean, var = tf.nn.moments(x, [2, 3, 4], keep_dims=True)
    x = (x - mean) / tf.sqrt(var + eps)
    x = tf.reshape(x, [N, C, H, W])
    return x * gamma + beta
Group Norm: channel groups within each example

Group Norm: channel groups within each example

RMSNorm: The Efficient Modern Choice

RMSNorm is the normalization you will find inside LLaMA, Gemma, and most recent large language models. It is an extension of Layer Norm that drops the re-centering step entirely and normalizes by the root mean square of the activations instead.

RMS(a) = √( (1/n) ∑ ai2 )
i = (ai / RMS(a)) · gi

Why I prefer it for large-scale training:

Practical note: If you are fine-tuning any modern LLM (LLaMA, Mistral, Gemma), you are already using RMSNorm whether you realize it or not. Understanding it helps you debug training instabilities.

Instance Norm and Weight Norm

These two are more specialized, but worth knowing.

Instance Norm

Instance Norm is like Layer Norm but normalizes across each channel independently in each training example. It is applied at test time, just like Layer Norm.

ytijk = (xtijk - μti) / √(σti2 + ε)

The main use case: it makes the network agnostic to the contrast of the original image, which is why it became the default in style transfer and image generation tasks.

Weight Norm

Weight Norm takes a completely different approach -- it normalizes the weights of the layer rather than the activations. It separates the weight vector into a magnitude and a direction:

w = (g / ||v||) · v

This decoupling gives you a smoother loss landscape and more stable training. I have found it most useful in CNN tasks, often as a complement to other normalization methods.

Instance Norm and Weight Norm diagrams

Instance Norm and Weight Norm diagrams

MNIST Convergence: What I Saw in Practice

I ran Batch Norm, Layer Norm, Instance Norm, and Group Norm on MNIST to see how they compare on a simple task:

Training & validation error curves on MNIST

Training & validation error curves on MNIST

When to Use What

Here is my decision process:

  1. Transformer or RNN? Use Layer Norm (or RMSNorm if you want efficiency).
  2. CNN with large batches? Use Batch Norm.
  3. CNN with small/variable batches? Use Group Norm.
  4. Style transfer or generative images? Use Instance Norm.
  5. Want a smoother loss landscape on CNNs? Try Weight Norm.
  6. Training a large language model from scratch? Use RMSNorm.

Quick Reference

| Technique | Normalizes Over | Batch Dependent? | Best For | Test Time | |---|---|---|---|---| | **Batch Norm** | Mini-batch (N) per channel | Yes | CNNs, large batches | Running stats | | **Layer Norm** | All features per example [C,H,W] | No | Transformers, RNNs | Same formula | | **Group Norm** | Channel groups per example | No | CNNs, small batches | Same formula | | **RMSNorm** | All features (RMS only) | No | LLMs, high-variance data | Same formula | | **Instance Norm** | Per channel, per example | No | Style transfer | Same formula | | **Weight Norm** | Weight vectors | No | CNNs, smooth optimization | Same formula |

This post is based on my presentation "Survey of Normalization Techniques" at Berkeley EECS. The original slides, including all diagrams and the full algorithm pseudocode, are available in Normalization_Techniques.pdf.