Deep Learning
from the Inside Out
Every architecture broken down to its weight matrices, computation diagrams, and FLOPs. No hand-waving -- just the math, the shapes, and the intuition.
Foundations
-
Start Here The Transformer: Where It All Started
The 2017 paper that launched everything. I trace the path from RNNs through seq2seq to the Transformer, explain self-attention and multi-head attention from scratch, and compare the three architecture families that every modern LLM builds on.
Read post -
Building Blocks A Survey of Activation Functions in Deep Learning
Activation functions are the nonlinearity that makes deep learning work. Every major function covered -- from the classics that started it all to the modern picks powering today's transformers.
Read post -
Building Blocks Normalization Techniques: From BatchNorm to RMSNorm
Normalization keeps your network from exploding during training. Every major technique surveyed, with practical guidance on when to use each one in modern architectures.
Read post -
Evaluation GLUE and SuperGLUE: How We Measure Language Understanding
Before you can improve a model, you need to measure it. Every task in both benchmarks broken down with concrete examples, dataset sizes, and evaluation metrics.
Read post
Model Architectures
-
Encoder-Only BERT from the Inside Out: Architecture, Matrices, and FLOPs
BERT reads text bidirectionally. I trace the data flow from input embeddings through multi-head self-attention and feed-forward layers, with exact tensor dimensions and FLOP calculations at every step.
Read post -
Decoder-Only GPT-2 Internals: How a Decoder-Only Transformer Works
GPT-2 generates text one token at a time. I explain the causal mask, walk through the full computation diagram, and compare its FLOPs against BERT's encoder architecture.
Read post -
Encoder-Decoder T5: The Text-to-Text Transformer, End to End
T5 treats every NLP task as text-to-text. Encoder self-attention, decoder masked attention, and the cross-attention bridge -- the key piece that gives T5 the best of both worlds.
Read post -
Efficient LLM LLaMA's Efficiency Playbook: RMSNorm, SwiGLU, and RoPE
LLaMA proved smaller models trained on more data beat larger models. Three key innovations -- RMSNorm, SwiGLU, and rotary embeddings -- traced through the full computation flow.
Read post
Techniques
-
Prompting Prompt Tuning: From Hard Prompts to Chain-of-Thought
The full evolution from expensive fine-tuning to parameter-efficient prompting. From hard prompts through soft prompts and ensembling, with a case study where MedPrompt pushed GPT-4 to 90.2% on medical QA.
Read post -
Scaling From GPT-3 to GPT-4: Scaling Laws, Few-Shot Learning, and Foundation Models
What happens when you scale a decoder-only transformer from millions to hundreds of billions of parameters? Few-shot learning emerges, multimodal capabilities appear, and entirely new risks surface. The capstone post covering GPT-3, PaLM, GPT-4, and the foundation model debate.
Read post