How much data is sufficient to learn high-performing algorithms?

Tue July 20th 2021, 4:30pm
Ellen Vitercik, Carnegie Mellon University

Algorithms often have tunable parameters that have a considerable impact on their runtime and solution quality. A growing body of research has demonstrated that data-driven algorithm design can lead to significant gains in runtime and solution quality. Data-driven algorithm design uses a training set of problem instances sampled from an unknown, application-specific distribution and returns a parameter setting with strong average performance on the training set. We provide a broadly applicable theory for deriving generalization guarantees for data-driven algorithm design, which bound the difference between the algorithm's expected performance and its average performance over the training set.

The challenge is that for many combinatorial algorithms, performance is a volatile function of the parameters: slightly perturbing the parameters can cause a cascade of changes in the algorithm’s behavior. Prior research has proved generalization bounds by employing case-by-case analyses of parameterized greedy algorithms, clustering algorithms, integer programming algorithms, and selling mechanisms. We uncover a unifying structure which we use to prove very general guarantees, yet we recover the bounds from prior research. Our guarantees apply whenever an algorithm's performance is a piecewise-constant, -linear, or — more generally — piecewise-structured function of its parameters. As we demonstrate, our theory also implies novel bounds for dynamic programming algorithms used in computational biology and voting mechanisms from economics.

This talk is based on joint work with Nina Balcan, Dan DeBlasio, Travis Dick, Carl Kingsford, and Tuomas Sandholm.

Zoom Recording [SUNet/SSO authentication required]