Scaling Foundation Models: From GPT-3 to GPT-4 and What Comes Next

By Hiva Mohammadzadeh | Stanford MSCS · UC Berkeley EECS

TL;DR: Scale changes everything. GPT-3 showed that a 175-billion-parameter model can perform tasks it was never explicitly trained on -- just from a handful of examples in the prompt. PaLM pushed to 540 billion parameters and broke through on multi-step reasoning. GPT-4 added multimodal input and hit human-level performance on professional benchmarks like the bar exam. These are not just bigger models doing the same thing slightly better. At each scale jump, qualitatively new capabilities emerge. This post covers the scaling trajectory from GPT-3 through PaLM to GPT-4, the opportunities foundation models unlock in healthcare, education, and finance, and the real risks -- bias, power concentration, and accountability gaps -- that come with them.

The Scaling Hypothesis

There is a pattern that has held remarkably consistently in deep learning over the past several years: make the model bigger, give it more data, and new capabilities appear. Not just incremental improvements -- genuinely new behaviors that smaller models could not do at all.

This is the scaling hypothesis, and the models covered in this post are its strongest evidence. GPT-2 had 1.5 billion parameters and could generate coherent paragraphs. GPT-3 scaled to 175 billion and could translate languages, answer trivia, and write code from a few prompt examples. PaLM reached 540 billion and solved multi-step reasoning problems. GPT-4 added vision and matched human professionals on standardized exams.

Model	Year	Parameters	Emergent Capability
GPT-2	2019	1.5B	Coherent multi-paragraph text generation
GPT-3	2020	175B	Few-shot and zero-shot learning from prompts alone
PaLM	2022	540B	Multi-step reasoning; surpasses human on BIG-bench
GPT-4	2023	Undisclosed	Multimodal input (image + text); human-level on bar exam

At each scale threshold, qualitatively new capabilities emerge that were absent in smaller models.

Scale is not just "more of the same." At certain thresholds, language models exhibit emergent capabilities -- behaviors that are absent in smaller models and appear suddenly as parameters increase. Few-shot learning, chain-of-thought reasoning, and code generation all emerged this way.

GPT-3: The Few-Shot Revolution

GPT-3 was the model that made the scaling hypothesis impossible to ignore. At 175 billion parameters -- 10x more than any previous non-sparse language model -- it demonstrated something genuinely surprising: you could get useful task performance without any task-specific training data at all.

The key insight is few-shot learning. Instead of fine-tuning on thousands of labeled examples, you provide a handful of input-output demonstrations directly in the prompt. The model generalizes from these examples and produces correct outputs on new inputs. GPT-3 did this across language translation, question answering, arithmetic, and even basic programming tasks.

GPT-3 was pre-trained on a massive corpus -- books, articles, web pages -- to predict the next word in a text sequence. Same objective as GPT-2. The difference is pure scale. And that scale unlocked three distinct modes of operation:

Zero-shot: Give the model a task description and an input. No examples. GPT-3 generates quality output on tasks it was never explicitly trained on.
Few-shot: Provide a few input-output examples in the prompt. No gradient updates. The model picks up the pattern and applies it to new inputs.
Fine-tuning: Traditional adaptation with task-specific training data and gradient updates to the model's parameters.

Zero-Shot

No examples needed

Task: Translate to French
Input: "Hello world"
Output: "Bonjour le monde"

0 gradient updates

Few-Shot

A few examples in prompt

Ex: dog → chien
Ex: cat → chat
Input: "bird" →
Output: "oiseau"

0 gradient updates

Fine-Tuning

Full training on task data

Data: 10K+ labeled pairs
Train: Update all weights
Result: Task-specific
model

Many gradient updates

GPT-3's three modes of operation. The breakthrough: zero-shot and few-shot require no gradient updates -- the model generalizes from the prompt alone.

The remarkable finding was how far zero-shot and few-shot could go. GPT-3 achieved state-of-the-art on SuperGLUE, surpassing previous models by a significant margin. It demonstrated open-ended generation -- producing coherent, contextually appropriate text unconstrained by specific task formats.

GPT-3's core contribution was not the architecture (it is essentially a scaled-up GPT-2). It was the demonstration that scaling up greatly improves task-agnostic, few-shot performance -- sometimes reaching competitiveness with fine-tuned models that had access to thousands of labeled examples.

This changed the economics of NLP. Instead of collecting labeled data and training a model for every task, you could use a single large model and steer it with prompts. The previous post on prompt tuning covers the techniques that evolved from this realization.

PaLM: 540 Billion Parameters and the Pathways System

If GPT-3 proved scale matters, PaLM proved it matters even more than people expected. PaLM (Pathways Language Model) is a 540-billion-parameter, dense decoder-only Transformer that pushed the frontier on what a single language model can do.

What makes PaLM architecturally interesting is not just the parameter count. It is how the model was trained and what infrastructure made it possible.

The Pathways System

PaLM was trained with the Pathways system, which enables efficiently training a single model across multiple TPU v4 Pods. This was the first large-scale use of Pathways, scaling to 6,144 TPU chips -- the largest TPU-based system configuration at the time. The system uses a hierarchical architecture with data parallelism at the Pod level across two Cloud TPU v4 Pods, combined with standard data and model parallelism within each Pod.

This is an infrastructure story as much as a model story. Training a 540-billion-parameter model requires a system that can coordinate thousands of accelerators efficiently, handle communication bottlenecks, and recover from hardware failures. Pathways solved this at unprecedented scale.

Architectural Choices

PaLM uses parallel layers to speed up training -- computing attention and feed-forward components simultaneously rather than sequentially. It uses multi-query attention to speed up inference. And it introduces the MLM-Mix pre-training objective, jointly pre-training autoencoding and autoregressive language modeling on a large unlabeled corpus, improving out-of-domain performance.

Breakthrough Results

PaLM 540B did not just set new state-of-the-art numbers. It broke through on tasks that previous models struggled with fundamentally:

Multi-step reasoning: PaLM outperformed fine-tuned state-of-the-art models on reasoning tasks -- surpassing them using few-shot prompting alone.
BIG-bench: PaLM outperformed average human performance on BIG-bench, a diverse suite of over 200 tasks designed to probe model capabilities.
Efficiency: Despite being 3x larger than GPT-3, PaLM achieved these results more efficiently thanks to Pathways and architectural optimizations.

The reasoning result is particularly significant. Reasoning was widely considered beyond the reach of next-token prediction. PaLM showed that scale, combined with the right training recipe, could produce genuinely compositional reasoning behavior.

GPT-4: Multimodal and Human-Level

GPT-4 represents the next inflection point. It is a large-scale, multimodal model that accepts both image and text inputs and produces text outputs. The "multimodal" part is not a gimmick -- it fundamentally expands what a language model can do.

GPT-4 is a Transformer model pre-trained to predict the next token, using publicly available data and data licensed from third parties. In terms of training objective, it is the same as GPT-2 and GPT-3. But the capabilities are categorically different.

What Changed

Multimodal input. GPT-4 can accept images alongside text. This is not image captioning -- the model reasons about visual content. In the example above, it identifies that chicken nuggets are arranged to look like a world map and explains why that is funny. This requires understanding the image content, recognizing the resemblance, and connecting it to the joke in the accompanying text.

Human-level professional performance. GPT-4 passes a simulated bar exam around the median of human test-takers. It exhibits human-level performance across various professional and academic benchmarks -- consistent, not cherry-picked.

Longer context. GPT-4 can handle up to 25,000 words of input, enabling processing of entire documents, long conversations, and complex multi-part instructions.

Better programming. GPT-4 is significantly better at processing programming instructions than GPT-3 -- more complex code generation, debugging, and explanation tasks.

Steerability. Users can instruct GPT-4 to adopt specific response styles, personas, or constraints, making it far more useful as a general-purpose tool.

Safety alignment. Post-training alignment improves factuality and adherence to instructions. The model was trained to limit harmful responses -- not perfect, but a deliberate engineering effort to make the model safer.

GPT-4 is a significant improvement over GPT-3 in every measurable dimension: it outperforms other models in English and far outperforms them in other languages. It handles longer prompts, more complex tasks, and produces more reliable outputs. The multimodal capability is not an add-on -- it is a signal that the next generation of foundation models will not be text-only.

Foundation Models: Opportunities

The term "foundation model" captures what these large-scale models have become: a base layer that entire applications and industries build on top of. The opportunities are real and span multiple domains.

Healthcare. Foundation models can process medical literature, assist with diagnosis, and support clinical decision-making. As I covered in the prompt tuning post, GPT-4 with MedPrompt achieved 90.2% on medical exam questions -- outperforming fine-tuned specialist models. The potential for research acceleration, drug discovery, and clinical documentation is substantial.

Education. Personalized tutoring, automated grading, and adaptive learning systems all become more feasible with models at this level. A model that can explain calculus, grade essays, and adapt its teaching style to individual students could transform access to quality education.

Finance. Sentiment analysis, automated report generation, risk assessment, and fraud detection all benefit from models that can reason about large volumes of text. Financial applications require high reliability, pushing the field toward better calibration and uncertainty estimation.

These are not speculative use cases. They are actively being deployed.

Foundation Models: Risks

Opportunities

Healthcare — MedPrompt: 90.2% on medical QA without fine-tuning

Education — Personalized tutoring and adaptive learning at scale

Finance — Risk assessment, fraud detection, automated analysis

Research — Drug discovery and scientific literature synthesis

Risks

Bias — Amplifies biases present in training data

Power concentration — Few companies control frontier models

Employment — Task automation with uneven economic impact

Accountability — Unclear responsibility for harmful outputs

Foundation models present enormous opportunities alongside real, concrete risks.

The same capabilities that make foundation models powerful make them dangerous if deployed without care. The risks are concrete and well-documented.

Bias

Foundation models learn from their training data. If that data contains biases -- and it does, reflecting the internet and published text -- the model reproduces and sometimes amplifies them. Models have been shown to generate biased outputs across gender, race, religion, and other categories. In high-stakes domains like healthcare and criminal justice, biased outputs cause real harm.

Concentration of Power

Training a 540-billion-parameter model requires infrastructure that only a handful of organizations possess. This concentrates the ability to build and control foundation models in a small number of large tech companies. Smaller organizations, academic researchers, and developers in less-resourced countries are increasingly dependent on APIs controlled by these companies.

Employment Impact

Foundation models can automate tasks that previously required human expertise -- writing, translation, coding, analysis. The productivity gains are real, but so is the displacement. The economic impact will not be evenly distributed.

Accountability Gaps

When a foundation model produces harmful output, who is responsible? The model developer? The application builder? The end user? Current legal and regulatory frameworks do not have clear answers, and the technology is moving faster than policy.

The research community has called for increased transparency, accountability, and collaboration among researchers, policymakers, and industry stakeholders. Responsible development of foundation models is not a separate workstream -- it needs to be embedded in the engineering process from the start.

What the Scaling Trajectory Tells Us

Zoom out and look at the progression:

Model	Year	Parameters	Key Capability
GPT-2	2019	1.5B	Coherent text generation
GPT-3	2020	175B	Few-shot and zero-shot learning
PaLM	2022	540B	Multi-step reasoning, surpassing human baselines
GPT-4	2023	Undisclosed	Multimodal input, human-level professional performance

Three patterns stand out.

First, emergent capabilities appear at scale thresholds. Few-shot learning was not a feature of GPT-2. It emerged at GPT-3 scale. Multi-step reasoning was weak in GPT-3. It became strong in PaLM. Multimodal understanding did not exist in text-only models. GPT-4 added it. Each scale jump produces capabilities that did not exist at the previous scale.

Second, the gap between few-shot and fine-tuned performance is closing. GPT-3 showed few-shot could compete with fine-tuning on some tasks. PaLM showed it could beat fine-tuned models on reasoning. The distinction between "general-purpose" and "task-specific" models is dissolving.

Third, infrastructure is becoming the bottleneck. PaLM required 6,144 TPU chips. Training runs at this scale cost millions of dollars and take weeks. The limiting factor is no longer algorithmic -- it is who can afford the compute. This has implications for who gets to participate in frontier research and who does not.

The open question is whether this trajectory continues. The optimistic view is that further scaling will produce further breakthroughs. The cautious view is that we are approaching data limitations (the internet is finite), compute limitations (chips are expensive), and fundamental limitations of next-token prediction as an objective. Both views have evidence.

What is clear is that the models we have today are already powerful enough to transform industries, displace workers, and concentrate power. The time to build robust governance, safety, and accountability frameworks is before the next capability jump, not after.

Key Takeaways

Scale produces qualitatively new capabilities. GPT-3 at 175B parameters demonstrated few-shot learning that GPT-2 at 1.5B could not do. PaLM at 540B broke through on multi-step reasoning. These are not incremental improvements -- they are emergent behaviors.
Few-shot learning changes the economics of NLP. Instead of collecting labeled data and fine-tuning for every task, a single large model can be steered with prompts. This collapses the cost of building task-specific systems.
Infrastructure determines who can participate. Training PaLM required 6,144 TPU chips and the Pathways system. This level of compute is available to a handful of organizations. The scaling trajectory is concentrating frontier research in fewer and fewer hands.
Multimodal is the next frontier. GPT-4 accepting image inputs is not a novelty feature. It signals that the next generation of foundation models will process text, images, audio, and video together. Applications built on text-only models will need to adapt.
Foundation models are double-edged. Healthcare, education, and finance all stand to benefit enormously. But bias amplification, power concentration, employment disruption, and accountability gaps are real risks that require deliberate intervention -- not just technical fixes, but policy, governance, and institutional design.
The gap between research and deployment is shrinking. GPT-3 was a research demonstration. GPT-4 is a product. The time between "interesting paper" and "deployed system affecting millions" is measured in months, not years.
Alignment and safety are engineering problems, not afterthoughts. GPT-4's post-training alignment -- improving factuality, reducing harmful outputs, increasing steerability -- is as important as the pre-training. Building capable models is only half the job.

The models covered in this series -- from activation functions through BERT, GPT-2, T5, LLaMA, benchmarks, prompt tuning, and now scaling -- trace a clear arc. We went from building individual components to assembling architectures to scaling those architectures until new capabilities emerged. The question now is not "can we build more powerful models?" We can. The question is "what do we do with them?"

Based on my CS199 Supervised Independent Study at UC Berkeley and the original Presentations I created in 2023.