Training Language Models with Textbook-Quality Synthetic Data

Microsoft Research just released a paper adding new fuel to the ongoing debates about the role of data in model training, specifically touching on the roles of data quality and synthetic data. While the paper’s focus is on training models to write Python code, its implications go far beyond coding. The insights from this work can serve as a valuable case study for language model projects in myriad contexts.

The models in Textbooks Are All You Need don’t owe their success to any ground-breaking design or training methods. In fact, the authors state that "our model architecture and training methods are fairly conventional." Instead, innovation lies in the training data. To quote from the paper:

"We hypothesize that such high quality data dramatically improves the learning efficiency of language models for code as they provide clear, self-contained, instructive, and balanced examples of coding concepts and skills."

The value of data quality is, in some ways, a given – it’s hard to imagine anyone advocating for training on lower-quality data when there’s an equal amount of better-quality data at hand. But opinions on the relative importance of data quality have seen a notable shift over the past few years.

Back in 2020, the OpenAI paper Scaling Laws for Neural Language Models positioned model size as the most important factor: "optimally compute-efficient training involves training very large models on a relatively modest amount of data". Then in 2022, DeepMind’s Chinchilla paper, Training Compute-Optimal Large Language Models, argued that data size was equally critical: "current large language models are significantly undertrained". But now, in 2023, the spotlight is shifting towards data quality. This shift is underlined by a section in the recently leaked Google memo titled We Have No Moat, which declared, "Data quality scales better than data size".

The Textbooks Are All You Need paper analyzed here is just one highlight of this larger movement. Another noteworthy example is LIMA: Less is More for Alignment, which shows how a small but high-quality dataset can be used to achieve impressive results in model alignment.

The utility of synthetic data – data generated by models themselves – has been a topic of much debate. Attempts to train smaller models on the output of larger models, such as in the creation of Alpaca and Vicuna, have met with skepticism. Critics often point to arguments such as those in the Berkeley paper The False Promise of Imitating Proprietary LLMs, which states that "model imitation is a false promise: there exists a substantial capabilities gap between open and closed LMs that, with current methods, can only be bridged using an unwieldy amount of imitation data or by using more capable base LMs".

However, Textbooks Are All You Need challenges this perspective, demonstrating that the output of larger models can be utilized for purposes beyond mere imitation. Remarkably, the paper’s small model even manages to outperform the large model that generated the synthetic data it was trained on. This observation prompts a tantalizing question: Could the performance of large models be enhanced by training them on their own output?

The results

Before delving into the training data used to train the models, let’s glance at the results they achieve. The three models in the paper are phi-1-base, phi-1, and phi-1-small. Notably, these models aren’t just compact in terms of parameters, they’re also trained on limited data. Given this, their performance is nothing short of astonishing.

Evaluation of selected models on the HumanEval benchmark. Source: adapted from Textbooks Are All You Need. — Evaluation of selected models on the HumanEval benchmark. Source: adapted from *Textbooks Are All You Need*.

The scores here are on OpenAI’s HumanEval benchmark, introduced in their paper Evaluating Large Language Models Trained on Code. In the problems in this benchmark, the model is told a function signature and docstring, and asked to write the body of the function. To illustrate, consider the following example drawn from the HumanEval paper, where the model is given the following signature and docstring.

Source: Evaluating Large Language Models Trained on Code.

For this problem, we hope the model would generate something like this:

However, the model is not evaluated based on producing this exact string (that would require the model to solve the problem in the same way and with the same variable names as the solution), but rather whatever body the model produces is evaluated on several unit tests (on average, 7.7 unit tests per problem, each test consisting of a choice of parameters for the function and the expected output that the generated code needs to match). The code is then deemed to be correct if it passes all of the unit tests. The pass@1 metric in the table above is merely the percentage of generated function bodies that pass all of the unit tests. The more general pass@k metrics allow models to general k samples, and consider it a success if any one of those samples passes all of the unit tests.

The models in the paper were trained on data from three different sources. The first, The Stack+, is a 35B-token, deduplicated version of The Stack, together with code from StackOverflow, and restricted to Python. However, it’s important to note that phi-1 and its variants are not trained on this source. Instead, these models are trained on CodeTextbook, a textbook-quality 6B-token filtered selection from The Stack+ together with a 1B-token synthetic component, and CodeExercises, a 180M-token synthetic set of exercises and solutions mirroring the problem style found in the HumanEval dataset. The effects are shown in the figure below.

HumanEval results after training on various sources. Image from Textbooks Are All You Need.

Here we see 9 models with varying parameters trained on varying subsets of this data. The models in light green in this chart are trained only on CodeTextbook, and not on The Stack+, so it is evident that CodeTextbook is a better source. The fine-tuning on CodeExercises that the models in dark green received makes an even bigger difference.

Three of the models in the chart are named:

phi-1-base is a 1.3B parameter model (pre)trained with "about 8 passes" over the 7B tokens of CodeTextbook. This amounts to about 50B tokens of training data, and took took 4 days on 8 A100s.
phi-1 is the result of fine-tuning phi-1-base on the 180M tokens of CodeExercises. This fine-tuning took 7 hours on 8 A100s.
phi-1-small is made using a similar process as phi-1, but with a 350M parameter model design and apparently about 11 passes over the CodeTextbook. It takes about 2 days to train on 8 A100s.

The filtered part of CodeTextbook (6B tokens)

For this part of CodeTextbook, they started with a 35B-token deduplicated and Python-restricted copy of The Stack together with code from StackOverflow referred to as Stack+ in the chart above. Then they filtered down to a 6B-token textbook-quality subset.

To do this filtering, GPT-4 is first used to determine the educational value of about 0.3% of the entire 35B-token dataset (100M tokens). The prompt used is "determine its educational value for a student whose goal is to learn basic coding concepts".

It’s not explicitly stated why GPT-4 was chosen over GPT-3.5 for this step, since GPT-3.5 is used for all other stages of the process. However, considering the task is classifying "only" 100M tokens, the use of GPT-4 is not overly expensive will certainly yield more accurate results.

Next, these annotations are used to train another model (a random forest classifier) to classify the rest of the dataset as high or low educational value. Subsequently, this classifier is used to filter the original dataset to a 6B-token dataset of high educational quality.

The synthetic part of CodeTextbook (1B tokens)

This is where things get more interesting, as the authors use GPT-3.5 to generate synthetic high quality "Python textbooks".

There is some precedent for using LLMs to generate synthetic data used to train smaller models. In an earlier Microsoft Research paper, TinyStories: How Small Can Language Models Be and Still Speak Coherent English?, the goal is to train small language models (1M to 33M parameters) to write intelligible stories at the level of toddlers, and the dataset consists entirely of stories written by GPT-3.5 and GPT-4. Quoting from the TinyStories paper:

"The main challenge in using large language models for producing training data is generating a dataset that is sufficiently diverse: prompting those models to produce stories, even if the temperature of generation is set to a high value, will still produce a very repetitive dataset, whose diversity is very far from what is required for training a language model that has a comparable "understanding" of language to that of children."

The trick TinyStories uses to diversify synthetic data is to choose three random words (a noun, a verb, and an adjective) and a small number of "story features" for each prompt. For example, one of their prompts is the following.

Source: TinyStories: How Small Can Language Models Be and Still Speak Coherent English?

Unfortunately, Microsoft Research doesn’t give us nearly as many details about their trick for generating a diverse collection of textbook-quality text, and the project does not appear to have released any code or data for us to investigate. They do say that they target the content to be "topics that prompt reasoning and basic algorithmic skills", and that they provide constraints on the topics and on the audience of the textbook. Below is their example of a typical response to one of their prompts, quoted from the paper.

Source: Textbooks Are All You Need. — Source: *Textbooks Are All You Need.*

Needless to say, it would be interesting to know a lot more about this step of the process. What are the specific prompts? How are the topics chosen? What audience(s?) is GPT-3.5 told to write for? It would also be interesting to inspect CodeTextbook, but the data has not been released.

CodeExercises (180M tokens)

The final piece of the training data for phi-1 and phi-1-small (though not for phi-1-base) is a set of exercises and solutions that mirror the format of the HumanEval benchmark problems. Once again, this data is entirely synthetic and produced by GPT-3.5. The authors say that diversity in the outputs was achieved by constraining the function names. While the exact meaning of this is not clear to me, it might entail another model generate a list of function names and signatures first, and then prompting GPT-3.5 to generate the corresponding docstring and body. The authors provide an example of a typical output, quoted below.

The authors refer to this dataset as small because it contains only 180M tokens. However, if the example above is representative, then CodeExercises contains on the order of one million exercises and solutions.

It’s fair to be suspicious that CodeExercises is simply stumbling onto the same functions as are in the HumanEval benchmark, leading to phi-1 being fine-tuned on solutions to the very exercises it is tested on. The authors devote considerable space (all of Section 5) to arguing against this concern. They first contend that there is limited similarity between CodeExercises and HumanEval. Secondly, they argue that even when exercises in CodeExercises that bear a slight resemblance to those in HumanEval are pruned (where resemblance is measured in terms of embedding distance), models trained on the pruned datasets remain impressive.

Cost

The focus of the paper, and of this deep dive into the paper, has been on data quality. However, it’s enlightening to consider what it would cost to duplicate the experiment today, at least to consider the relative costs of its individual components.

Filtering. The process of filtering The Stack+ involved using GPT-4 to determine the educational value of 100,000 files, or about 100M input tokens. Ignoring the output tokens (which would be minimal) and using today’s price of $0.03 / 1K input tokens, this would cost about $3,000.
Synthesizing. CodeTextbook and CodeExercises together contain about 1280M tokens of GPT-3.5-generated text. At today’s price of $0.002 / 1K output tokens, creating this data would cost a little over $2,500.
Training. The phi-1 model was trained for 1090 hours. At today’s price of about $1/hour for an A100, this would amount to about $1,000. The 350M-parameter phi-1-small could be trained for $400.

Approximately $6,500 of compute went into the creation of phi-1.

The authors speculate that using GPT-4 for the synthesizing would be a lot better: "we also believe that significant gains could be achieved by using GPT-4 to generate the synthetic data instead of GPT-3.5, as we noticed that GPT-3.5 data has a high error rate." But, these costs show why they didn’t. At 30 times the price of GPT-3.5, it would cost about $75,000 to generate the synthetic portion of CodeTextbook and CodeExercises with GPT-4.

Conclusion

The results from Textbooks Are All You Need are very impressive, especially given the smaller size of the models and the limited training data they were given. This paper is one more piece of evidence that data quality can make up for data quantity and model size.

The discussion around synthetic data will undoubtedly persist. The concept is appealing – if we don’t have high-quality data readily available, could we just synthesize it? Textbooks Are All You Need teases some promising possibilities in this area. Still, it’s not the perfect experiment we might dream of, given that only about 1B of the 7B tokens in CodeTextbook were synthetically created. But it’s worth pointing out that the other 6B tokens were filtered synthetically.

Training on entirely synthetic data has shown some exciting results in the field of image processing. The Google Research study, StableRep: Synthetic Images from Text-to-Image Models Make Strong Visual Representation Learners, takes a text-to-image model and trains it entirely on synthetic data produced by Stable Diffusion. The outcomes they report match or surpass the performance of Stable Diffusion itself.

A similar approach was taken with the TinyStories paper, which relied only on synthetic data for training. But, the models it used were very small. What if larger language models were trained in the same way? The potential this presents is exciting, and it will no doubt be the focus of numerous studies in the future.

References

Chen, M., Tworek, J., Jun, H., Yuan, Q., de Oliveira Pinto, H. P., Kaplan, J., Edwards, H., Burda, Y., Joseph, N., Brockman, G., Ray, A., Puri, R., Krueger, G., Petrov, M., Khlaaf, H., Sastry, G., Mishkin, P., Chan, B., Gray, S., Ryder, N., Pavlov, M., Power, A., Kaiser, L., Bavarian, M., Winter, C., Tillet, P., Such, F. P., Cummings, D., Plappert, M., Chantzis, F., Barnes, E., Herbert-Voss, A., Guss, W. H., Nichol, A., Paino, A., Tezak, N., Tang, J., Babuschkin, I., Balaji, S., Jain, S., Saunders, W., Hesse, C., Carr, A. N., Leike, J., Achiam, J., Misra, V., Morikawa, E., Radford, A., Knight, M., Brundage, M., Murati, M., Mayer, K., Welinder, P., McGrew, B., Amodei, D., McCandlish, S., Sutskever, I., and Zaremba, W. (2021). Evaluating large language models trained on code. arXiv:2107.03374.

Eldan, R. and Li, Y. (2023). TinyStories: How small can language models be and still speak coherent English? arXiv:2305.07759.

Gudibande, A., Wallace, E., Snell, C., Geng, X., Liu, H., Abbeel, P., Levine, S., and Song, D. (2023). The false promise of imitating proprietary LLMs. arXiv:2305.15717.

Gunasekar, S., Zhang, Y., Aneja, J., Mendes, C. C. T., Giorno, A. D., Gopi, S., Javaheripi, M., Kau mann, P., de Rosa, G., Saarikivi, O., Salim, A., Shah, S., Behl, H. S., Wang, X., Bubeck, S., Eldan, R., Kalai, A. T., Lee, Y. T., and Li, Y. (2023). Textbooks are all you need. arXiv:2306.11644.

Hoffmann, J., Borgeaud, S., Mensch, A., Buchatskaya, E., Cai, T., Rutherford, E., de Las Casas, D., Hendricks, L. A., Welbl, J., Clark, A., Hennigan, T., Noland, E., Millican, K., van den Driessche, G., Damoc, B., Guy, A., Osin- dero, S., Simonyan, K., Elsen, E., Rae, J. W., Vinyals, O., and Sifre, L. (2022). Training compute-optimal large language models. arXiv:2203.15556.

Kaplan, J., McCandlish, S., Henighan, T., Brown, T. B., Chess, B., Child, R., Gray, S., Radford, A., Wu, J., and Amodei, D. (2020). Scaling laws for neural language models. arXiv:2001.08361. Tian, Y., Fan, L., Isola, P., Chang, H., and Krishnan, D. (2023). StableRep: Synthetic images from text-to-image models make strong visual representa- tion learners. arXiv:2306.00984. Zhou, C., Liu, P., Xu, P., Iyer, S., Sun, J., Mao, Y., Ma, X., Efrat, A., Yu, P., Yu, L., Zhang, S., Ghosh, G., Lewis, M., Zettlemoyer, L., and Levy, O. (2023). LIMA: Less is more for alignment. arXiv:2305.11206.