Vincent Vatter, Author at Towards Data Science

Avoidable and Unavoidable Randomness in GPT-4o

Vincent Vatter — Mon, 03 Mar 2025 12:00:00 +0000

Of course there is randomness in GPT-4o’s outputs. After all, the model samples from a probability distribution when choosing each token. But what I didn’t understand was that those very probabilities themselves are not deterministic. Even with consistent prompts, fixed seeds, and temperature set to zero, GPT-4o still introduces subtle, frustrating randomness.

There’s no fix for this, and it might not even be something OpenAI could fix if they wanted to, just so we’re clear up front about where this article is headed. Along the way, we’ll examine all the sources of randomness in GPT-4o output, which will require us to break down the sampling process to a low level. We’ll point at the issue—the probabilities vary—and critically examine OpenAI’s official guidance on determinism.

First, though, let’s talk about why determinism matters. Determinism means that the same input always produces the same output, like a mathematical function. While LLM creativity is often desirable, determinism serves crucial purposes: researchers need it for reproducible experiments, developers for verifying reported results, and prompt engineers for debugging their changes. Without it, you’re left wondering if different outputs stem from your tweaks or just the random number generator’s mood swings.

Flipping a coin

We’re going to keep things extremely simple here and prompt the most recent version of GPT-4o (gpt-4o-2024-08-06 in the API) with this:

Flip a coin. Return Heads or Tails only.

Flipping a coin with LLMs is a fascinating topic in itself (see for example Van Koevering & Kleinberg, 2024 in the references), but here, we’ll use it as a simple binary question with which to explore determinism, or the lack thereof.

This is our first attempt.

import os
from openai import OpenAI
client = OpenAI(api_key=os.getenv('OPENAI_API_KEY'))

prompt = 'Flip a coin. Return Heads or Tails only.'

response = client.chat.completions.create(
    model='gpt-4o-2024-08-06',
    messages=[{'role': 'user', 'content': prompt}],
)

print(response.choices[0].message.content)

Running the code gave me Heads. Maybe you’ll get Tails, or if you’re really lucky, something far more interesting.

The code first initializes an OpenAI client with an API key set in the environment variable OPENAI_API_KEY (to avoid sharing billing credentials here). The main action happens with client.chat.completions.create, where we specify the model to use and send the prompt (as a part of a very simple conversation named messages) to the server. We get an object called response back from the server. This object contains a lot of information, as shown below, so we need to dig into it to extract GPT-4o’s actual response to the message, which is response.choices[0].message.content.

>>> response
ChatCompletion(id='chatcmpl-B48EqZBLfUWtp9H7cwnchGTJbBDwr', choices=[Choice(finish_reason='stop', index=0, logprobs=None, message=ChatCompletionMessage(content='Heads', refusal=None, role='assistant', audio=None, function_call=None, tool_calls=None))], created=1740324680, model='gpt-4o-2024-08-06', object='chat.completion', service_tier='default', system_fingerprint='fp_eb9dce56a8', usage=CompletionUsage(completion_tokens=2, prompt_tokens=18, total_tokens=20, completion_tokens_details=CompletionTokensDetails(accepted_prediction_tokens=0, audio_tokens=0, reasoning_tokens=0, rejected_prediction_tokens=0), prompt_tokens_details=PromptTokensDetails(audio_tokens=0, cached_tokens=0)))

Now let’s flip the coin ten times. If this were a real, fair coin, of course, we would expect roughly equal heads and tails over time thanks to the law of large numbers. But GPT-4o’s coin doesn’t work quite like that.

import os
from openai import OpenAI
client = OpenAI(api_key=os.getenv('OPENAI_API_KEY'))

prompt = 'Flip a coin. Return Heads or Tails only.'

for _ in range(10):
    response = client.chat.completions.create(
        model='gpt-4o-2024-08-06',
        messages=[{'role': 'user', 'content': prompt}],
    )
    print(response.choices[0].message.content)

Running this code gave me the following output, although you might get different output, of course.

Heads
Heads
Heads
Heads
Heads
Heads
Tails
Heads
Heads
Heads

GPT-4o’s coin is clearly biased, but so are humans. Bar-Hillel, Peer, and Acquisti (2014) found that people flipping imaginary coins choose “heads” 80% of the time. Maybe GPT-4o learned that from us. But whatever the reason, we’re just using this simple example to explore determinism.

Just how biased is GPT-4o’s coin?

Let’s say we wanted to know precisely what percentage of GPT-4o coin flips land Heads.

Rather than the obvious (but expensive) approach of flipping it a million times, there’s a smarter way. For classification tasks with a small set of possible answers, we can extract token probabilities instead of generating full responses. With the right prompt, the first token carries all the necessary information, making these API calls incredibly cheap: around 30,000 calls per dollar, since each requires just 18 (cached) input tokens and 1 output token.

OpenAI gives us (natural) log probabilities. These are called logprobs in the code, and we convert them to regular probabilities by exponentiation. (We’ll discuss temperature soon, but note that exponentiating logprobs directly like this corresponds to a temperature setting of 1.0, and is how we calculate probabilities throughout this article). OpenAI lets us request logprobs for the top 20 most likely tokens, so we do that.

import os
import math
from openai import OpenAI
from tabulate import tabulate

client = OpenAI(api_key=os.getenv('OPENAI_API_KEY'))

prompt = 'Flip a coin. Return Heads or Tails only.'

response = client.chat.completions.create(
    model='gpt-4o-2024-08-06',
    max_tokens=1,
    logprobs=True,
    top_logprobs=20,
    messages=[{'role': 'user', 'content': prompt}],
)

logprobs_list = response.choices[0].logprobs.content[0].top_logprobs

data = []
total_pct = 0.0

for logprob_entry in logprobs_list:
    token = logprob_entry.token
    logprob = logprob_entry.logprob
    pct = math.exp(logprob) * 100  # Convert logprob to a percentage
    total_pct += pct
    data.append([token, logprob, pct])

print(
    tabulate(
        data,
        headers=["Token", "Log Probability", "Percentage (%)"],
        tablefmt="github",
        floatfmt=("s", ".10f", ".10f")
    )
)
print(f"\nTotal probabilities: {total_pct:.6f}%")

If you run this, you’ll get something like the following output, but actual numbers will vary.

| Token     |   Log Probability |   Percentage (%) |
|-----------|-------------------|------------------|
| Heads     |     -0.0380541235 |    96.2660836887 |
| T         |     -3.2880542278 |     3.7326407467 |
| Sure      |    -12.5380544662 |     0.0003587502 |
| Head      |    -12.7880544662 |     0.0002793949 |
| Tail      |    -13.2880544662 |     0.0001694616 |
| Certainly |    -13.5380544662 |     0.0001319768 |
| "T        |    -14.2880544662 |     0.0000623414 |
| I'm       |    -14.5380544662 |     0.0000485516 |
| heads     |    -14.5380544662 |     0.0000485516 |
| Heads     |    -14.9130544662 |     0.0000333690 |
| "         |    -15.1630544662 |     0.0000259878 |
| _heads    |    -15.1630544662 |     0.0000259878 |
| tails     |    -15.5380544662 |     0.0000178611 |
| HEAD      |    -15.7880544662 |     0.0000139103 |
| TAIL      |    -16.2880535126 |     0.0000084370 |
| T         |    -16.7880535126 |     0.0000051173 |
| ```       |    -16.7880535126 |     0.0000051173 |
| Here's    |    -16.9130535126 |     0.0000045160 |
| I         |    -17.2880535126 |     0.0000031038 |
| As        |    -17.2880535126 |     0.0000031038 |

Total probabilities: 99.999970%

Looking at these probabilities, we see Heads at ≈96% and T at ≈4%. Our prompt is doing pretty well at constraining the model’s responses. Why T and not Tails? This is the tokenizer splitting Tails into T + ails, while keeping Heads as one piece, as we can see in this Python session:

>>> import tiktoken
>>> encoding = tiktoken.encoding_for_model("gpt-4o-2024-08-06")
>>> encoding.encode('Tails')
[51, 2196]
>>> encoding.decode([51])
'T'
>>> encoding.encode('Heads')
[181043]

These probabilities are not deterministic

Run the code to display the probabilities for the top 20 tokens again, and you’ll likely get different numbers. Here’s what I got on a second running.

| Token     |   Log Probability |   Percentage (%) |
|-----------|-------------------|------------------|
| Heads     |     -0.0110520627 |    98.9008786933 |
| T         |     -4.5110521317 |     1.0986894433 |
| Certainly |    -14.0110521317 |     0.0000822389 |
| Head      |    -14.2610521317 |     0.0000640477 |
| Sure      |    -14.2610521317 |     0.0000640477 |
| Tail      |    -14.3860521317 |     0.0000565219 |
| heads     |    -15.3860521317 |     0.0000207933 |
| Heads     |    -15.5110521317 |     0.0000183500 |
| ```       |    -15.5110521317 |     0.0000183500 |
| _heads    |    -15.6360521317 |     0.0000161938 |
| tails     |    -15.6360521317 |     0.0000161938 |
| I'm       |    -15.8860521317 |     0.0000126117 |
| "T        |    -15.8860521317 |     0.0000126117 |
| As        |    -16.3860511780 |     0.0000076494 |
| "         |    -16.5110511780 |     0.0000067506 |
| HEAD      |    -16.6360511780 |     0.0000059574 |
| TAIL      |    -16.7610511780 |     0.0000052574 |
| Here's    |    -16.7610511780 |     0.0000052574 |
| ``        |    -17.1360511780 |     0.0000036133 |
| T         |    -17.6360511780 |     0.0000021916 |

Total probabilities: 99.999987%

In their cookbook, OpenAI offers the following advice on receiving “mostly identical” outputs:

If the seed, request parameters, and system_fingerprint all match across your requests, then model outputs will mostly be identical. There is a small chance that responses differ even when request parameters and system_fingerprint match, due to the inherent non-determinism of our models.

They also give “mostly identical” advice in the reproducible outputs section of their documentation.

The request parameters that could affect randomness are temperature and seed. OpenAI also suggests we track system_fingerprint, because differences here might cause differences in output. We’ll examine each of these below, but spoiler: none of them will fix or even explain this non-determinism.

Temperature, and why it won’t fix this

Temperature controls how random the model’s responses are. Low temperatures (<0.5) make it robotic and predictable, medium temperatures (0.7–1.3) allow some creativity, and high temperatures (>1.5) produce gibberish. Temperature is often called the “creativity parameter”, but this is an oversimplification. In their analysis, Peeperkorn, Kouwenhoven, Brown, and Jordanous (2024) evaluated LLM outputs across four dimensions of creativity: novelty (originality), coherence (logical consistency), cohesion (how well the text flows), and typicality (how well it fits expected patterns). They observed that:

temperature is weakly correlated with novelty, and unsurprisingly, moderately correlated with incoherence, but there is no relationship with either cohesion or typicality.

But, this is beside the point for coin flipping. Under the hood, the log probabilities are divided by the temperature before they’re renormalized and exponentiated to be converted to probabilities. This creates a non-linear effect: temperature=0.5 squares the probabilities, making likely tokens dominate, while temperature=2.0 applies a square root, flattening the distribution.

What about temperature=0.0? Instead of breaking math dividing by zero, the model simply picks the highest-probability token. Sounds deterministic, right? Not quite. Here’s the catch: temperature only comes into play after the log probabilities are computed, when we convert them to probabilities.

In summary: if the logprobs aren’t deterministic, setting temperature to 0.0 won’t make the model deterministic.

In fact, since we’re just asking the model for the raw logprobs directly rather than generating full responses, the temperature setting doesn’t come into play in our code at all.

Seeds, and why they won’t fix this

After temperature is used to compute probabilities, the model samples from these probabilities to pick the next token. OpenAI gives us a little control over the sampling process by letting us set the seed parameter for the random number generator. In an ideal world, setting a seed would give us determinism at any temperature. But seeds only affect sampling, not the log probabilities before sampling.

In summary: if the logprobs aren’t deterministic, setting a seed won’t make the model deterministic.

In fact, seed only matters with non-zero temperatures. With temperature=0.0, the model is always choosing the highest probability token regardless of the seed. Again, since we’re just asking the model for the raw logprobs directly rather than sampling, neither of these settings can help us achieve determinism.

System fingerprints, our last hope

The system_fingerprint identifies the current combination of model weights, infrastructure, and configuration options in OpenAI’s backend. At least, that’s what OpenAI tells us. Variations in system fingerprints might indeed explain variations in logprobs. Except that they don’t, as we will verify below.

Nothing can get you determinism

Let’s confirm what we’ve been building toward. We’ll run the same request 10 times with every safeguard in place. Even though neither of these parameters should matter for what we’re doing, you can never be too safe, so we’ll set temperature=0.0 and seed=42. And to see if infrastructure differences explain our varying logprobs, we’ll print system_fingerprint. Here’s the code:

import os
import math
from openai import OpenAI
from tabulate import tabulate
from tqdm import tqdm

client = OpenAI(api_key=os.getenv('OPENAI_API_KEY'))

prompt = 'Flip a coin. Return Heads or Tails only.'

data = []

for _ in tqdm(range(10), desc='Generating responses'):
    response = client.chat.completions.create(
        model='gpt-4o-2024-08-06',
        temperature=0.0,
        seed=42,
        max_tokens=1,
        logprobs=True,
        top_logprobs=20,
        messages=[{'role': 'user', 'content': prompt}],
    )

    fingerprint = response.system_fingerprint
    logprobs_list = response.choices[0].logprobs.content[0].top_logprobs
    heads_logprob = next(
        entry.logprob for entry in logprobs_list if entry.token == 'Heads'
    )
    pct = math.exp(heads_logprob) * 100
    data.append([fingerprint, heads_logprob, f"{pct:.10f}%"])

headers = ["Fingerprint", "Logprob", "Probability"]
print(tabulate(data, headers=headers, tablefmt="pipe"))

Running this 10 times, here are the logprobs and probabilities for the token Heads:

| Fingerprint   |    Logprob | Probability    |
|---------------|------------|----------------|
| fp_f9f4fb6dbf | -0.0380541 | 96.2660836887% |
| fp_f9f4fb6dbf | -0.0380541 | 96.2660836887% |
| fp_f9f4fb6dbf | -0.0380541 | 96.2660836887% |
| fp_f9f4fb6dbf | -0.0380541 | 96.2660836887% |
| fp_f9f4fb6dbf | -0.160339  | 85.1854886858% |
| fp_f9f4fb6dbf | -0.0380541 | 96.2660836887% |
| fp_f9f4fb6dbf | -0.0110521 | 98.9008786933% |
| fp_f9f4fb6dbf | -0.0380541 | 96.2660836887% |
| fp_f9f4fb6dbf | -0.0380541 | 96.2660836887% |
| fp_f9f4fb6dbf | -0.0380541 | 96.2660836887% |

Mixture-of-experts makes determinism impossible

OpenAI is decidedly not open about the architecture behind GPT-4o. However, it’s widely believed that GPT-4o uses a mixture-of-experts (MoE) architecture with either 8 or 16 experts.

According to a paper by Google DeepMind researchers Puigcerver, Riquelme, Mustafa, and Houlsby (hat tip to user elmstedt on the OpenAI forum), mixture-of-experts architectures may add an unavoidable level of non-determinism:

Under capacity constraints, all Sparse MoE approaches route tokens in groups of a fixed size and enforce (or encourage) balance within the group. When groups contain tokens from different sequences or inputs, these tokens compete for available spots in expert buffers. Therefore, the model is no longer deterministic at the sequence-level, but only at the batch-level.

In other words, when your prompt (a sequence of tokens, in the quote above) reaches OpenAI’s servers, it gets batched with a group of other prompts (OpenAI isn’t open about how many other prompts). Each prompt in the batch is then routed to an “expert” within the model. However, since only so many prompts can be routed to the same expert, the expert your prompt gets routed to will depend on all the other prompts in the batch.

This “competition” for experts introduces a real-world randomness completely beyond our control.

Non-determinism beyond mixture-of-experts

While non-determinism may be inherent to real-world mixture-of-experts models, that does not seem to be the only source of non-determinism in OpenAI’s models.

Making a few changes to our code above (switching to gpt-3.5-turbo-0125, looking for the token He since GPT-3.5’s tokenizer splits “Heads” differently, and ignoring system_fingerprint because this model doesn’t have it) reveals that GPT-3.5-turbo also exhibits non-deterministic logprobs:

|     Logprob | Probability    |
|-------------|----------------|
| -0.00278289 | 99.7220983436% |
| -0.00415331 | 99.5855302068% |
| -0.00258838 | 99.7414961980% |
| -0.00204034 | 99.7961735289% |
| -0.00240277 | 99.7600117933% |
| -0.00204034 | 99.7961735289% |
| -0.00204034 | 99.7961735289% |
| -0.00258838 | 99.7414961980% |
| -0.00351419 | 99.6491976144% |
| -0.00201214 | 99.7989878007% |

No one is claiming that GPT-3.5-turbo uses a mixture-of-experts architecture. Thus, there must be additional factors beyond mixture-of-experts contributing to this non-determinism.

What 10,000 GPT-4o coin flip probabilities tell us

To better understand the patterns and magnitude of this non-determinism, I conducted a more extensive experiment with GPT-4o, performing 10,000 “coin flips” while recording the probability assigned to “Heads” in each case.

The results reveal something fascinating. Across 10,000 API calls with identical parameters, GPT-4o produced not just a few different probability values, but 42 distinct probabilities. If the mixture-of-experts hypothesis were the complete explanation for non-determinism in GPT-4o, we might expect to see one distinct probability for each expert. But GPT-4o is believed to have either 8 or 16 experts, not 42.

In the output below, I clustered these probabilities, ensuring that each cluster was separated from the others by 0.01 (as a raw percentage). This groups the output into 12 clusters.

Probability          Count           Fingerprints
------------------------------------------------------------------
85.1854379113%       5               fp_eb9dce56a8, fp_f9f4fb6dbf
85.1854455275%       74              fp_eb9dce56a8, fp_f9f4fb6dbf
85.1854886858%       180             fp_eb9dce56a8, fp_f9f4fb6dbf
------------------------------------------------------------------
88.0662448207%       31              fp_eb9dce56a8, fp_f9f4fb6dbf
88.0678628883%       2               fp_f9f4fb6dbf
------------------------------------------------------------------
92.3997629747%       1               fp_eb9dce56a8
92.3997733012%       4               fp_eb9dce56a8
92.3997836277%       3               fp_eb9dce56a8
------------------------------------------------------------------
92.4128943690%       1               fp_f9f4fb6dbf
92.4129143363%       21              fp_eb9dce56a8, fp_f9f4fb6dbf
92.4129246643%       8               fp_eb9dce56a8, fp_f9f4fb6dbf
------------------------------------------------------------------
93.9906837191%       4               fp_eb9dce56a8
------------------------------------------------------------------
95.2569999350%       36              fp_eb9dce56a8
------------------------------------------------------------------
96.2660836887%       3391            fp_eb9dce56a8, fp_f9f4fb6dbf
96.2661285161%       2636            fp_eb9dce56a8, fp_f9f4fb6dbf
------------------------------------------------------------------
97.0674551052%       1               fp_eb9dce56a8
97.0674778863%       3               fp_eb9dce56a8
97.0675003058%       4               fp_eb9dce56a8
97.0675116963%       1               fp_eb9dce56a8
97.0680739932%       19              fp_eb9dce56a8, fp_f9f4fb6dbf
97.0681293191%       6               fp_eb9dce56a8, fp_f9f4fb6dbf
97.0681521003%       74              fp_eb9dce56a8, fp_f9f4fb6dbf
97.0682421405%       4               fp_eb9dce56a8
------------------------------------------------------------------
97.7008960695%       1               fp_f9f4fb6dbf
97.7011122645%       3               fp_eb9dce56a8
97.7011462953%       3               fp_eb9dce56a8
97.7018178132%       1               fp_eb9dce56a8
------------------------------------------------------------------
98.2006069902%       426             fp_eb9dce56a8, fp_f9f4fb6dbf
98.2006876548%       6               fp_f9f4fb6dbf
98.2007107019%       1               fp_eb9dce56a8
98.2009525133%       5               fp_eb9dce56a8
98.2009751945%       1               fp_eb9dce56a8
98.2009867181%       1               fp_eb9dce56a8
------------------------------------------------------------------
98.5930987656%       3               fp_eb9dce56a8, fp_f9f4fb6dbf
98.5931104270%       235             fp_eb9dce56a8, fp_f9f4fb6dbf
98.5931222721%       4               fp_eb9dce56a8, fp_f9f4fb6dbf
98.5931340253%       9               fp_eb9dce56a8
98.5931571644%       159             fp_eb9dce56a8, fp_f9f4fb6dbf
98.5931805790%       384             fp_eb9dce56a8
------------------------------------------------------------------
98.9008436920%       95              fp_eb9dce56a8, fp_f9f4fb6dbf
98.9008550214%       362             fp_eb9dce56a8, fp_f9f4fb6dbf
98.9008786933%       1792            fp_eb9dce56a8, fp_f9f4fb6dbf

(With a threshold of 0.001 there are 13 clusters, and with a threshold of 0.0001 there are 17 clusters.)

As the chart above demonstrates, this multitude of results cannot be explained by system_fingerprint values. Across all 10,000 calls, I received only two different system fingerprints: 4488 results with fp_f9f4fb6dbf and 5512 with fp_eb9dce56a8, and for the most part the two system fingerprints returned the same sets probabilities, rather than each fingerprint producing its own distinct set of probabilities.

It could be that these 12 clusters of probabilities represent 12 different experts. Even assuming that, the variations within the clusters remain puzzling. These don’t seem likely to be simple rounding errors, because they are too systematic and consistent. Take the giant cluster at around 96.266% with two distinct probabilities representing over half of our coin flips. The difference between these two probabilities, 0.0000448274%, is tiny but persistent.

Conclusion: Non-determinism is baked in

There is an underlying randomness in the log probabilities returned by all currently available non-thinking OpenAI models: GPT-4o, GPT-4o-mini, and the two flavors of GPT-3.5-turbo. Because this non-determinism is baked into the log probabilities, there’s no way for a user to get around it. Temperature and seed values have no effect, and system fingerprints don’t explain it.

While mixture-of-experts architectures inherently introduce some randomness in the competition for experts, the non-determinism in GPT-4o seems to go far beyond this, and the non-determinism in GPT-3.5-turbo can’t be explained by this at all, because GPT-3.5-turbo isn’t a mixture-of-experts model.

While we can’t verify this claim any more because the model isn’t being served, this behaviour wasn’t seen with GPT-3, according to user _j on the OpenAI forum:

It is a symptom that was not seen on prior GPT-3 AI models where across hundreds of trials to investigate sampling, you never had to doubt that logprobs would be the same. Even if you found a top-2 answer that returned exactly the same logprob value via the API, you would never see them switch position or return different values.

This suggests that whatever is causing this randomness first emerged in either GPT-3.5 or GPT-3.5-turbo.

But regardless of when it emerged, this non-determinism is a serious obstacle to understanding these models. If you want to study a model—how it generalizes, how it biases responses, how it assigns probabilities to different tokens—you need consistency. but as we’ve seen, even when we lock down every knob OpenAI lets us touch, we still can’t get an answer to the simplest possible question: “what is the probability that GPT-4o says a coin lands heads?”

Worse, while mixture-of-experts explains some of this non-determinism, there are clearly other, hidden sources of randomness that we can’t see, control, or understand. In an ideal world, the API would provide more transparency by telling us which expert processed our request or by offering additional parameters to control this routing process. Without such visibility, we’re left guessing at the true nature of the variability.

References

Bar-Hillel, M., Peer, E., & Acquisti, A. (2014). “Heads or tails?” – A reachability bias in binary choice. Journal of Experimental Psychology: Learning, Memory, and Cognition, 40(6), 1656–1663. https://doi.org/10.1037/xlm0000005.

Peeperkorn, M., Kouwenhoven, T., Brown, D., & Jordanous, A. (2024). Is temperature the creativity parameter of Large Language Models?. In The 15th International Conference on Computational Creativity (ICCC’24). arXiv:2405.00492.

Puigcerver, J., Riquelme, C., Mustafa, B., & Houlsby, N. (2024). From sparse to soft mixtures of experts. In The Twelfth International Conference on Learning Representations (ICLR 2024). https://openreview.net/forum?id=jxpsAj7ltE. arXiv:2308.00951.Van Koevering, K., & Kleinberg, J. (2024). How random is random? Evaluating the Randomness and humanness of LLMs’ coin flips. arXiv:2406.00092.

The post Avoidable and Unavoidable Randomness in GPT-4o appeared first on Towards Data Science.

Training Language Models with Textbook-Quality Synthetic Data

Vincent Vatter — Fri, 30 Jun 2023 06:43:18 +0000

Image by Dall-E 3

Microsoft Research just released a paper adding new fuel to the ongoing debates about the role of data in model training, specifically touching on the roles of data quality and synthetic data. While the paper’s focus is on training models to write Python code, its implications go far beyond coding. The insights from this work can serve as a valuable case study for language model projects in myriad contexts.

The models in Textbooks Are All You Need don’t owe their success to any ground-breaking design or training methods. In fact, the authors state that "our model architecture and training methods are fairly conventional." Instead, innovation lies in the training data. To quote from the paper:

"We hypothesize that such high quality data dramatically improves the learning efficiency of language models for code as they provide clear, self-contained, instructive, and balanced examples of coding concepts and skills."

The value of data quality is, in some ways, a given – it’s hard to imagine anyone advocating for training on lower-quality data when there’s an equal amount of better-quality data at hand. But opinions on the relative importance of data quality have seen a notable shift over the past few years.

Back in 2020, the OpenAI paper Scaling Laws for Neural Language Models positioned model size as the most important factor: "optimally compute-efficient training involves training very large models on a relatively modest amount of data". Then in 2022, DeepMind’s Chinchilla paper, Training Compute-Optimal Large Language Models, argued that data size was equally critical: "current large language models are significantly undertrained". But now, in 2023, the spotlight is shifting towards data quality. This shift is underlined by a section in the recently leaked Google memo titled We Have No Moat, which declared, "Data quality scales better than data size".

The Textbooks Are All You Need paper analyzed here is just one highlight of this larger movement. Another noteworthy example is LIMA: Less is More for Alignment, which shows how a small but high-quality dataset can be used to achieve impressive results in model alignment.

The utility of synthetic data – data generated by models themselves – has been a topic of much debate. Attempts to train smaller models on the output of larger models, such as in the creation of Alpaca and Vicuna, have met with skepticism. Critics often point to arguments such as those in the Berkeley paper The False Promise of Imitating Proprietary LLMs, which states that "model imitation is a false promise: there exists a substantial capabilities gap between open and closed LMs that, with current methods, can only be bridged using an unwieldy amount of imitation data or by using more capable base LMs".

However, Textbooks Are All You Need challenges this perspective, demonstrating that the output of larger models can be utilized for purposes beyond mere imitation. Remarkably, the paper’s small model even manages to outperform the large model that generated the synthetic data it was trained on. This observation prompts a tantalizing question: Could the performance of large models be enhanced by training them on their own output?

The results

Before delving into the training data used to train the models, let’s glance at the results they achieve. The three models in the paper are phi-1-base, phi-1, and phi-1-small. Notably, these models aren’t just compact in terms of parameters, they’re also trained on limited data. Given this, their performance is nothing short of astonishing.

Evaluation of selected models on the HumanEval benchmark. Source: adapted from Textbooks Are All You Need.

The scores here are on OpenAI’s HumanEval benchmark, introduced in their paper Evaluating Large Language Models Trained on Code. In the problems in this benchmark, the model is told a function signature and docstring, and asked to write the body of the function. To illustrate, consider the following example drawn from the HumanEval paper, where the model is given the following signature and docstring.

Source: Evaluating Large Language Models Trained on Code.

For this problem, we hope the model would generate something like this:

Source: Evaluating Large Language Models Trained on Code.

However, the model is not evaluated based on producing this exact string (that would require the model to solve the problem in the same way and with the same variable names as the solution), but rather whatever body the model produces is evaluated on several unit tests (on average, 7.7 unit tests per problem, each test consisting of a choice of parameters for the function and the expected output that the generated code needs to match). The code is then deemed to be correct if it passes all of the unit tests. The pass@1 metric in the table above is merely the percentage of generated function bodies that pass all of the unit tests. The more general pass@k metrics allow models to general k samples, and consider it a success if any one of those samples passes all of the unit tests.

The models in the paper were trained on data from three different sources. The first, The Stack+, is a 35B-token, deduplicated version of The Stack, together with code from StackOverflow, and restricted to Python. However, it’s important to note that phi-1 and its variants are not trained on this source. Instead, these models are trained on CodeTextbook, a textbook-quality 6B-token filtered selection from The Stack+ together with a 1B-token synthetic component, and CodeExercises, a 180M-token synthetic set of exercises and solutions mirroring the problem style found in the HumanEval dataset. The effects are shown in the figure below.

HumanEval results after training on various sources. Image from Textbooks Are All You Need.

Here we see 9 models with varying parameters trained on varying subsets of this data. The models in light green in this chart are trained only on CodeTextbook, and not on The Stack+, so it is evident that CodeTextbook is a better source. The fine-tuning on CodeExercises that the models in dark green received makes an even bigger difference.

Three of the models in the chart are named:

phi-1-base is a 1.3B parameter model (pre)trained with "about 8 passes" over the 7B tokens of CodeTextbook. This amounts to about 50B tokens of training data, and took took 4 days on 8 A100s.
phi-1 is the result of fine-tuning phi-1-base on the 180M tokens of CodeExercises. This fine-tuning took 7 hours on 8 A100s.
phi-1-small is made using a similar process as phi-1, but with a 350M parameter model design and apparently about 11 passes over the CodeTextbook. It takes about 2 days to train on 8 A100s.

The filtered part of CodeTextbook (6B tokens)

For this part of CodeTextbook, they started with a 35B-token deduplicated and Python-restricted copy of The Stack together with code from StackOverflow referred to as Stack+ in the chart above. Then they filtered down to a 6B-token textbook-quality subset.

To do this filtering, GPT-4 is first used to determine the educational value of about 0.3% of the entire 35B-token dataset (100M tokens). The prompt used is "determine its educational value for a student whose goal is to learn basic coding concepts".

It’s not explicitly stated why GPT-4 was chosen over GPT-3.5 for this step, since GPT-3.5 is used for all other stages of the process. However, considering the task is classifying "only" 100M tokens, the use of GPT-4 is not overly expensive will certainly yield more accurate results.

Next, these annotations are used to train another model (a random forest classifier) to classify the rest of the dataset as high or low educational value. Subsequently, this classifier is used to filter the original dataset to a 6B-token dataset of high educational quality.

The synthetic part of CodeTextbook (1B tokens)

This is where things get more interesting, as the authors use GPT-3.5 to generate synthetic high quality "Python textbooks".

There is some precedent for using LLMs to generate synthetic data used to train smaller models. In an earlier Microsoft Research paper, TinyStories: How Small Can Language Models Be and Still Speak Coherent English?, the goal is to train small language models (1M to 33M parameters) to write intelligible stories at the level of toddlers, and the dataset consists entirely of stories written by GPT-3.5 and GPT-4. Quoting from the TinyStories paper:

"The main challenge in using large language models for producing training data is generating a dataset that is sufficiently diverse: prompting those models to produce stories, even if the temperature of generation is set to a high value, will still produce a very repetitive dataset, whose diversity is very far from what is required for training a language model that has a comparable "understanding" of language to that of children."

The trick TinyStories uses to diversify synthetic data is to choose three random words (a noun, a verb, and an adjective) and a small number of "story features" for each prompt. For example, one of their prompts is the following.

Source: TinyStories: How Small Can Language Models Be and Still Speak Coherent English?

Unfortunately, Microsoft Research doesn’t give us nearly as many details about their trick for generating a diverse collection of textbook-quality text, and the project does not appear to have released any code or data for us to investigate. They do say that they target the content to be "topics that prompt reasoning and basic algorithmic skills", and that they provide constraints on the topics and on the audience of the textbook. Below is their example of a typical response to one of their prompts, quoted from the paper.

Source: Textbooks Are All You Need.

Needless to say, it would be interesting to know a lot more about this step of the process. What are the specific prompts? How are the topics chosen? What audience(s?) is GPT-3.5 told to write for? It would also be interesting to inspect CodeTextbook, but the data has not been released.

CodeExercises (180M tokens)

The final piece of the training data for phi-1 and phi-1-small (though not for phi-1-base) is a set of exercises and solutions that mirror the format of the HumanEval benchmark problems. Once again, this data is entirely synthetic and produced by GPT-3.5. The authors say that diversity in the outputs was achieved by constraining the function names. While the exact meaning of this is not clear to me, it might entail another model generate a list of function names and signatures first, and then prompting GPT-3.5 to generate the corresponding docstring and body. The authors provide an example of a typical output, quoted below.

Source: Textbooks Are All You Need.

The authors refer to this dataset as small because it contains only 180M tokens. However, if the example above is representative, then CodeExercises contains on the order of one million exercises and solutions.

It’s fair to be suspicious that CodeExercises is simply stumbling onto the same functions as are in the HumanEval benchmark, leading to phi-1 being fine-tuned on solutions to the very exercises it is tested on. The authors devote considerable space (all of Section 5) to arguing against this concern. They first contend that there is limited similarity between CodeExercises and HumanEval. Secondly, they argue that even when exercises in CodeExercises that bear a slight resemblance to those in HumanEval are pruned (where resemblance is measured in terms of embedding distance), models trained on the pruned datasets remain impressive.

Cost

The focus of the paper, and of this deep dive into the paper, has been on data quality. However, it’s enlightening to consider what it would cost to duplicate the experiment today, at least to consider the relative costs of its individual components.

Filtering. The process of filtering The Stack+ involved using GPT-4 to determine the educational value of 100,000 files, or about 100M input tokens. Ignoring the output tokens (which would be minimal) and using today’s price of $0.03 / 1K input tokens, this would cost about $3,000.
Synthesizing. CodeTextbook and CodeExercises together contain about 1280M tokens of GPT-3.5-generated text. At today’s price of $0.002 / 1K output tokens, creating this data would cost a little over $2,500.
Training. The phi-1 model was trained for 1090 hours. At today’s price of about $1/hour for an A100, this would amount to about $1,000. The 350M-parameter phi-1-small could be trained for $400.

Approximately $6,500 of compute went into the creation of phi-1.

The authors speculate that using GPT-4 for the synthesizing would be a lot better: "we also believe that significant gains could be achieved by using GPT-4 to generate the synthetic data instead of GPT-3.5, as we noticed that GPT-3.5 data has a high error rate." But, these costs show why they didn’t. At 30 times the price of GPT-3.5, it would cost about $75,000 to generate the synthetic portion of CodeTextbook and CodeExercises with GPT-4.

Conclusion

The results from Textbooks Are All You Need are very impressive, especially given the smaller size of the models and the limited training data they were given. This paper is one more piece of evidence that data quality can make up for data quantity and model size.

The discussion around synthetic data will undoubtedly persist. The concept is appealing – if we don’t have high-quality data readily available, could we just synthesize it? Textbooks Are All You Need teases some promising possibilities in this area. Still, it’s not the perfect experiment we might dream of, given that only about 1B of the 7B tokens in CodeTextbook were synthetically created. But it’s worth pointing out that the other 6B tokens were filtered synthetically.

Training on entirely synthetic data has shown some exciting results in the field of image processing. The Google Research study, StableRep: Synthetic Images from Text-to-Image Models Make Strong Visual Representation Learners, takes a text-to-image model and trains it entirely on synthetic data produced by Stable Diffusion. The outcomes they report match or surpass the performance of Stable Diffusion itself.

A similar approach was taken with the TinyStories paper, which relied only on synthetic data for training. But, the models it used were very small. What if larger language models were trained in the same way? The potential this presents is exciting, and it will no doubt be the focus of numerous studies in the future.

References

Chen, M., Tworek, J., Jun, H., Yuan, Q., de Oliveira Pinto, H. P., Kaplan, J., Edwards, H., Burda, Y., Joseph, N., Brockman, G., Ray, A., Puri, R., Krueger, G., Petrov, M., Khlaaf, H., Sastry, G., Mishkin, P., Chan, B., Gray, S., Ryder, N., Pavlov, M., Power, A., Kaiser, L., Bavarian, M., Winter, C., Tillet, P., Such, F. P., Cummings, D., Plappert, M., Chantzis, F., Barnes, E., Herbert-Voss, A., Guss, W. H., Nichol, A., Paino, A., Tezak, N., Tang, J., Babuschkin, I., Balaji, S., Jain, S., Saunders, W., Hesse, C., Carr, A. N., Leike, J., Achiam, J., Misra, V., Morikawa, E., Radford, A., Knight, M., Brundage, M., Murati, M., Mayer, K., Welinder, P., McGrew, B., Amodei, D., McCandlish, S., Sutskever, I., and Zaremba, W. (2021). Evaluating large language models trained on code. arXiv:2107.03374.

Eldan, R. and Li, Y. (2023). TinyStories: How small can language models be and still speak coherent English? arXiv:2305.07759.

Gudibande, A., Wallace, E., Snell, C., Geng, X., Liu, H., Abbeel, P., Levine, S., and Song, D. (2023). The false promise of imitating proprietary LLMs. arXiv:2305.15717.

Gunasekar, S., Zhang, Y., Aneja, J., Mendes, C. C. T., Giorno, A. D., Gopi, S., Javaheripi, M., Kau mann, P., de Rosa, G., Saarikivi, O., Salim, A., Shah, S., Behl, H. S., Wang, X., Bubeck, S., Eldan, R., Kalai, A. T., Lee, Y. T., and Li, Y. (2023). Textbooks are all you need. arXiv:2306.11644.

Hoffmann, J., Borgeaud, S., Mensch, A., Buchatskaya, E., Cai, T., Rutherford, E., de Las Casas, D., Hendricks, L. A., Welbl, J., Clark, A., Hennigan, T., Noland, E., Millican, K., van den Driessche, G., Damoc, B., Guy, A., Osin- dero, S., Simonyan, K., Elsen, E., Rae, J. W., Vinyals, O., and Sifre, L. (2022). Training compute-optimal large language models. arXiv:2203.15556.

Kaplan, J., McCandlish, S., Henighan, T., Brown, T. B., Chess, B., Child, R., Gray, S., Radford, A., Wu, J., and Amodei, D. (2020). Scaling laws for neural language models. arXiv:2001.08361. Tian, Y., Fan, L., Isola, P., Chang, H., and Krishnan, D. (2023). StableRep: Synthetic images from text-to-image models make strong visual representa- tion learners. arXiv:2306.00984. Zhou, C., Liu, P., Xu, P., Iyer, S., Sun, J., Mao, Y., Ma, X., Efrat, A., Yu, P., Yu, L., Zhang, S., Ghosh, G., Lewis, M., Zettlemoyer, L., and Levy, O. (2023). LIMA: Less is more for alignment. arXiv:2305.11206.

The post Training Language Models with Textbook-Quality Synthetic Data appeared first on Towards Data Science.