10-part technical series

Deep Learning
from the Inside Out

Every architecture broken down to its weight matrices, computation diagrams, and FLOPs. No hand-waving -- just the math, the shapes, and the intuition.

Hiva Mohammadzadeh

Stanford MSCS UC Berkeley EECS

MS in Computer Science at Stanford. Previously UC Berkeley EECS. Researching efficient NLP, LLM architectures, and model compression.

hivam.org Presentations

10 Deep Dives

4 LLM Architectures

BERT

Encoder-Only
Bidirectional

GPT

GPT-2

Decoder-Only
Autoregressive

T5

Encoder-Decoder
Text-to-Text

LLaMA

Efficient LLM
SwiGLU + RoPE

Foundations

Start Here The Transformer: Where It All Started
Attention is All You Need · Self-Attention · RNN → Transformer · Encoder vs Decoder

The 2017 paper that launched everything. I trace the path from RNNs through seq2seq to the Transformer, explain self-attention and multi-head attention from scratch, and compare the three architecture families that every modern LLM builds on.
Read post
Building Blocks A Survey of Activation Functions in Deep Learning
Linear · Sigmoid · Tanh · ReLU · Leaky ReLU · GeLU · Swish

Activation functions are the nonlinearity that makes deep learning work. Every major function covered -- from the classics that started it all to the modern picks powering today's transformers.
Read post
Building Blocks Normalization Techniques: From BatchNorm to RMSNorm
Layer Norm · Batch Norm · Group Norm · RMSNorm · Instance Norm · Weight Norm

Normalization keeps your network from exploding during training. Every major technique surveyed, with practical guidance on when to use each one in modern architectures.
Read post
Evaluation GLUE and SuperGLUE: How We Measure Language Understanding
9 GLUE tasks · 8 SuperGLUE tasks · Metrics · Real examples

Before you can improve a model, you need to measure it. Every task in both benchmarks broken down with concrete examples, dataset sizes, and evaluation metrics.
Read post

Model Architectures

Encoder-Only BERT from the Inside Out: Architecture, Matrices, and FLOPs
12 blocks · 768-dim · Bidirectional · 1.86B FLOPs/block

BERT reads text bidirectionally. I trace the data flow from input embeddings through multi-head self-attention and feed-forward layers, with exact tensor dimensions and FLOP calculations at every step.
Read post
Decoder-Only GPT-2 Internals: How a Decoder-Only Transformer Works
12 blocks · Causal Mask · Autoregressive · 1.51B FLOPs/block

GPT-2 generates text one token at a time. I explain the causal mask, walk through the full computation diagram, and compare its FLOPs against BERT's encoder architecture.
Read post
Encoder-Decoder T5: The Text-to-Text Transformer, End to End
12 enc + dec blocks · Cross-Attention · Span Corruption · Prefix Conditioning

T5 treats every NLP task as text-to-text. Encoder self-attention, decoder masked attention, and the cross-attention bridge -- the key piece that gives T5 the best of both worlds.
Read post
Efficient LLM LLaMA's Efficiency Playbook: RMSNorm, SwiGLU, and RoPE
7B-65B params · SwiGLU FFN · Rotary Embeddings · Open Data Only

LLaMA proved smaller models trained on more data beat larger models. Three key innovations -- RMSNorm, SwiGLU, and rotary embeddings -- traced through the full computation flow.
Read post

Techniques

Prompting Prompt Tuning: From Hard Prompts to Chain-of-Thought
Fine-Tuning · Soft Prompts · Prefix Tuning · CoT · MedPrompt (90.2% on MedQA)

The full evolution from expensive fine-tuning to parameter-efficient prompting. From hard prompts through soft prompts and ensembling, with a case study where MedPrompt pushed GPT-4 to 90.2% on medical QA.
Read post
Scaling From GPT-3 to GPT-4: Scaling Laws, Few-Shot Learning, and Foundation Models
GPT-3 (175B) · PaLM (540B) · GPT-4 (Multimodal) · Ethics · Foundation Models

What happens when you scale a decoder-only transformer from millions to hundreds of billions of parameters? Few-shot learning emerges, multimodal capabilities appear, and entirely new risks surface. The capstone post covering GPT-3, PaLM, GPT-4, and the foundation model debate.
Read post

Deep Learningfrom the Inside Out