Main content start

Generalized data thinning using sufficient statistics

Date
Tue October 1st 2024, 4:30pm
Location
Sloan 380Y
Speaker
Jacob Bien, USC

Sample splitting is one of the most tried-and-true tools in the statistician's toolbox. It breaks a data set into two independent parts, allowing one to perform valid inference after an exploratory analysis or after training a model. Several recent papers (Rasines & Young 2022, Leiner et al. 2022, Neufeld, et al. 2023) have provided alternatives to sample splitting, which the authors show to be attractive in situations where sample splitting is not possible. Their methods proceed very differently from sample splitting, and yet remarkably they also produce two statistically independent data sets from the original. In this talk, we set out to understand better what makes these approaches work and to clarify how broadly such approaches may apply. In particular, we show that sufficiency is the key underlying principle enabling these new approaches. This insight leads naturally to a new framework, which we call generalized data thinning. This generalization unifies traditional sample splitting with these new approaches, revealing them to all be applications of the same procedure. Finally, the framework reveals situations in which splitting into independent parts is impossible.