Main content start

Misclassification: The hidden source of hallucinations in language models

Date
Tue April 29th 2025, 4:30pm
Location
CoDa E160
Speaker
Adam Kalai, OpenAI

Pretrained language models (LMs) generate plausible-sounding errors, so-called hallucinations, at surprisingly high rates in certain domains. We cast the problem of generating valid responses as a binary classification task: distinguishing acceptable outputs from plausible but erroneous ones. Even in standard supervised learning with both positive and negative examples, some errors are inevitable. LMs pretrained on positive examples alone should be at least as error-prone (though LM post-training may mitigate these errors).

Drawing on familiar concepts from statistics and machine learning (e.g., sample complexity, computational constraints, and noisy data), we show that these classical causes of misclassification likewise drive hallucinations. Formally, we prove two results. First, we establish a lower bound: the hallucination rate is nearly twice the misclassification rate. Second, we apply this bound to the case of "independent facts" and show that the hallucination rate is at least the fraction of facts that appear exactly once in the training data. Both theorems assume only a minimal "weak calibration" property that pretrained LMs should satisfy (though post-training mitigations reducing errors will therefore also violate calibration).

This is joint work with Santosh Vempala.