Scaling Foundation Models: From GPT-3 to GPT-4 and What Comes Next
By Hiva Mohammadzadeh | Stanford MSCS ยท UC Berkeley EECS
The Scaling Hypothesis
There is a pattern that has held remarkably consistently in deep learning over the past several years: make the model bigger, give it more data, and new capabilities appear. Not just incremental improvements -- genuinely new behaviors that smaller models could not do at all.
This is the scaling hypothesis, and the models covered in this post are its strongest evidence. GPT-2 had 1.5 billion parameters and could generate coherent paragraphs. GPT-3 scaled to 175 billion and could translate languages, answer trivia, and write code from a few prompt examples. PaLM reached 540 billion and solved multi-step reasoning problems. GPT-4 added vision and matched human professionals on standardized exams.
The scaling trajectory from GPT-2 to GPT-4. Circle size represents parameter count. At each scale threshold, qualitatively new capabilities emerge that were absent in smaller models.
GPT-3: The Few-Shot Revolution
GPT-3 was the model that made the scaling hypothesis impossible to ignore. At 175 billion parameters -- 10x more than any previous non-sparse language model -- it demonstrated something genuinely surprising: you could get useful task performance without any task-specific training data at all.
The key insight is few-shot learning. Instead of fine-tuning on thousands of labeled examples, you provide a handful of input-output demonstrations directly in the prompt. The model generalizes from these examples and produces correct outputs on new inputs. GPT-3 did this across language translation, question answering, arithmetic, and even basic programming tasks.
GPT-3 was pre-trained on a massive corpus -- books, articles, web pages -- to predict the next word in a text sequence. Same objective as GPT-2. The difference is pure scale. And that scale unlocked three distinct modes of operation:
- Zero-shot: Give the model a task description and an input. No examples. GPT-3 generates quality output on tasks it was never explicitly trained on.
- Few-shot: Provide a few input-output examples in the prompt. No gradient updates. The model picks up the pattern and applies it to new inputs.
- Fine-tuning: Traditional adaptation with task-specific training data and gradient updates to the model's parameters.
Input: "Hello world"
Output: "Bonjour le monde"
Ex: cat โ chat
Input: "bird" โ
Output: "oiseau"
Train: Update all weights
Result: Task-specific
model
GPT-3's three modes of operation. The breakthrough: zero-shot and few-shot require no gradient updates -- the model generalizes from the prompt alone.
The remarkable finding was how far zero-shot and few-shot could go. GPT-3 achieved state-of-the-art on SuperGLUE, surpassing previous models by a significant margin. It demonstrated open-ended generation -- producing coherent, contextually appropriate text unconstrained by specific task formats.
This changed the economics of NLP. Instead of collecting labeled data and training a model for every task, you could use a single large model and steer it with prompts. The previous post on prompt tuning covers the techniques that evolved from this realization.
PaLM: 540 Billion Parameters and the Pathways System
If GPT-3 proved scale matters, PaLM proved it matters even more than people expected. PaLM (Pathway Language Model) is a 540-billion-parameter, dense decoder-only Transformer that pushed the frontier on what a single language model can do.
What makes PaLM architecturally interesting is not just the parameter count. It is how the model was trained and what infrastructure made it possible.
The Pathways System
PaLM was trained with the Pathways system, which enables efficiently training a single model across multiple TPU v4 Pods. This was the first large-scale use of Pathways, scaling to 6,144 TPU chips -- the largest TPU-based system configuration at the time. The system uses a hierarchical architecture with data parallelism at the Pod level across two Cloud TPU v4 Pods, combined with standard data and model parallelism within each Pod.
This is an infrastructure story as much as a model story. Training a 540-billion-parameter model requires a system that can coordinate thousands of accelerators efficiently, handle communication bottlenecks, and recover from hardware failures. Pathways solved this at unprecedented scale.
Architectural Choices
PaLM uses parallel layers to speed up training -- computing attention and feed-forward components simultaneously rather than sequentially. It uses multi-query attention to speed up inference. And it introduces the MLM-Mix pre-training objective, jointly pre-training autoencoding and autoregressive language modeling on a large unlabeled corpus, improving out-of-domain performance.
Breakthrough Results
PaLM 540B did not just set new state-of-the-art numbers. It broke through on tasks that previous models struggled with fundamentally:
- Multi-step reasoning: PaLM outperformed fine-tuned state-of-the-art models on reasoning tasks -- surpassing them using few-shot prompting alone.
- BIG-bench: PaLM outperformed average human performance on BIG-bench, a diverse suite of over 200 tasks designed to probe model capabilities.
- Efficiency: Despite being 3x larger than GPT-3, PaLM achieved these results more efficiently thanks to Pathways and architectural optimizations.
The reasoning result is particularly significant. Reasoning was widely considered beyond the reach of next-token prediction. PaLM showed that scale, combined with the right training recipe, could produce genuinely compositional reasoning behavior.
GPT-4: Multimodal and Human-Level
GPT-4 represents the next inflection point. It is a large-scale, multimodal model that accepts both image and text inputs and produces text outputs. The "multimodal" part is not a gimmick -- it fundamentally expands what a language model can do.
GPT-4 is a Transformer model pre-trained to predict the next token, using publicly available data and data licensed from third parties. In terms of training objective, it is the same as GPT-2 and GPT-3. But the capabilities are categorically different.
What Changed
Multimodal input. GPT-4 can accept images alongside text. This is not image captioning -- the model reasons about visual content. In the example above, it identifies that chicken nuggets are arranged to look like a world map and explains why that is funny. This requires understanding the image content, recognizing the resemblance, and connecting it to the joke in the accompanying text.
Human-level professional performance. GPT-4 passes a simulated bar exam around the median of human test-takers. It exhibits human-level performance across various professional and academic benchmarks -- consistent, not cherry-picked.
Longer context. GPT-4 can handle up to 25,000 words of input, enabling processing of entire documents, long conversations, and complex multi-part instructions.
Better programming. GPT-4 is significantly better at processing programming instructions than GPT-3 -- more complex code generation, debugging, and explanation tasks.
Steerability. Users can instruct GPT-4 to adopt specific response styles, personas, or constraints, making it far more useful as a general-purpose tool.
Safety alignment. Post-training alignment improves factuality and adherence to instructions. The model was trained to limit harmful responses -- not perfect, but a deliberate engineering effort to make the model safer.
Foundation Models: Opportunities
The term "foundation model" captures what these large-scale models have become: a base layer that entire applications and industries build on top of. The opportunities are real and span multiple domains.
Healthcare. Foundation models can process medical literature, assist with diagnosis, and support clinical decision-making. As I covered in the prompt tuning post, GPT-4 with MedPrompt achieved 90.2% on medical exam questions -- outperforming fine-tuned specialist models. The potential for research acceleration, drug discovery, and clinical documentation is substantial.
Education. Personalized tutoring, automated grading, and adaptive learning systems all become more feasible with models at this level. A model that can explain calculus, grade essays, and adapt its teaching style to individual students could transform access to quality education.
Finance. Sentiment analysis, automated report generation, risk assessment, and fraud detection all benefit from models that can reason about large volumes of text. Financial applications require high reliability, pushing the field toward better calibration and uncertainty estimation.
These are not speculative use cases. They are actively being deployed.
Foundation Models: Risks
Foundation models present enormous opportunities in healthcare, education, finance, and research -- but come with real risks around bias, power concentration, employment displacement, and accountability gaps.
The same capabilities that make foundation models powerful make them dangerous if deployed without care. The risks are concrete and well-documented.
Bias
Foundation models learn from their training data. If that data contains biases -- and it does, reflecting the internet and published text -- the model reproduces and sometimes amplifies them. Models have been shown to generate biased outputs across gender, race, religion, and other categories. In high-stakes domains like healthcare and criminal justice, biased outputs cause real harm.
Concentration of Power
Training a 540-billion-parameter model requires infrastructure that only a handful of organizations possess. This concentrates the ability to build and control foundation models in a small number of large tech companies. Smaller organizations, academic researchers, and developers in less-resourced countries are increasingly dependent on APIs controlled by these companies.
Employment Impact
Foundation models can automate tasks that previously required human expertise -- writing, translation, coding, analysis. The productivity gains are real, but so is the displacement. The economic impact will not be evenly distributed.
Accountability Gaps
When a foundation model produces harmful output, who is responsible? The model developer? The application builder? The end user? Current legal and regulatory frameworks do not have clear answers, and the technology is moving faster than policy.
What the Scaling Trajectory Tells Us
Zoom out and look at the progression:
| Model | Year | Parameters | Key Capability |
|---|---|---|---|
| GPT-2 | 2019 | 1.5B | Coherent text generation |
| GPT-3 | 2020 | 175B | Few-shot and zero-shot learning |
| PaLM | 2022 | 540B | Multi-step reasoning, surpassing human baselines |
| GPT-4 | 2023 | Undisclosed | Multimodal input, human-level professional performance |
Three patterns stand out.
First, emergent capabilities appear at scale thresholds. Few-shot learning was not a feature of GPT-2. It emerged at GPT-3 scale. Multi-step reasoning was weak in GPT-3. It became strong in PaLM. Multimodal understanding did not exist in text-only models. GPT-4 added it. Each scale jump produces capabilities that did not exist at the previous scale.
Second, the gap between few-shot and fine-tuned performance is closing. GPT-3 showed few-shot could compete with fine-tuning on some tasks. PaLM showed it could beat fine-tuned models on reasoning. The distinction between "general-purpose" and "task-specific" models is dissolving.
Third, infrastructure is becoming the bottleneck. PaLM required 6,144 TPU chips. Training runs at this scale cost millions of dollars and take weeks. The limiting factor is no longer algorithmic -- it is who can afford the compute. This has implications for who gets to participate in frontier research and who does not.
The open question is whether this trajectory continues. The optimistic view is that further scaling will produce further breakthroughs. The cautious view is that we are approaching data limitations (the internet is finite), compute limitations (chips are expensive), and fundamental limitations of next-token prediction as an objective. Both views have evidence.
What is clear is that the models we have today are already powerful enough to transform industries, displace workers, and concentrate power. The time to build robust governance, safety, and accountability frameworks is before the next capability jump, not after.
Key Takeaways
-
Scale produces qualitatively new capabilities. GPT-3 at 175B parameters demonstrated few-shot learning that GPT-2 at 1.5B could not do. PaLM at 540B broke through on multi-step reasoning. These are not incremental improvements -- they are emergent behaviors.
-
Few-shot learning changes the economics of NLP. Instead of collecting labeled data and fine-tuning for every task, a single large model can be steered with prompts. This collapses the cost of building task-specific systems.
-
Infrastructure determines who can participate. Training PaLM required 6,144 TPU chips and the Pathways system. This level of compute is available to a handful of organizations. The scaling trajectory is concentrating frontier research in fewer and fewer hands.
-
Multimodal is the next frontier. GPT-4 accepting image inputs is not a novelty feature. It signals that the next generation of foundation models will process text, images, audio, and video together. Applications built on text-only models will need to adapt.
-
Foundation models are double-edged. Healthcare, education, and finance all stand to benefit enormously. But bias amplification, power concentration, employment disruption, and accountability gaps are real risks that require deliberate intervention -- not just technical fixes, but policy, governance, and institutional design.
-
The gap between research and deployment is shrinking. GPT-3 was a research demonstration. GPT-4 is a product. The time between "interesting paper" and "deployed system affecting millions" is measured in months, not years.
-
Alignment and safety are engineering problems, not afterthoughts. GPT-4's post-training alignment -- improving factuality, reducing harmful outputs, increasing steerability -- is as important as the pre-training. Building capable models is only half the job.
The models covered in this series -- from activation functions through BERT, GPT-2, T5, LLaMA, benchmarks, prompt tuning, and now scaling -- trace a clear arc. We went from building individual components to assembling architectures to scaling those architectures until new capabilities emerged. The question now is not "can we build more powerful models?" We can. The question is "what do we do with them?"
Based on my CS199 Supervised Independent Study at UC Berkeley and the original Presentations I created in 2023.