Main content start

Scaling language models under a data constraint

Tue July 16th 2024, 4:30pm
Sloan 380C
Zitong Yang, Stanford Statistics

Improvements to language models via scaling have relied upon ever-larger pretraining datasets collected from the internet. However, this paradigm has its limits: the large but finite amount of internet data is rapidly being consumed, and a significant fraction of the world's written information lies in proprietary and domain-specific texts that are too small for language model pretraining.

In this work, we study this problem of data constrained scaling where we aim to create language models that can operate in compute-rich, data-constrained regimes and acquire knowledge from as little data as a few books. The small amount of data in this setting leads to little to no learning for standard language model training objectives, and we show that a synthetic, data-based, data-augmentation approach enables strong scaling and high-performance models. The core idea of our approach is to extract salient entities from the source knowledge text and then generate diverse synthetic text by drawing the connection between different entities.

Empirically, we show our method enables the model to learn from a small corpus in three different settings: books Q&A, summarization, and online lecture Q&A. In these settings, we demonstrate that our approach shows favorable scaling with no additional compute-time cost at inference time. Mathematically, we build a simple graph-based model that captures the core intuition behind our algorithm and show numerically that it delivers meaningful scaling trends that are predictive of the behavior of our method.

This talk is based on joined work with Neil Band, Emmanuel Cand├Ęs, and Tatsunori Hashimoto.