The science of measurement: Interpretability and reward hacking in ML

Tue November 9th 2021, 4:30pm
SBJC at Stanford / Online
Jacob Steinhardt, UC Berkeley

In machine learning, we are obsessed with datasets and metrics: progress in areas as diverse as natural language understanding, object recognition, and reinforcement learning is tracked by numerical scores on agreed-upon benchmarks. However, other ways of measuring ML models are underappreciated, and can unlock important insights.

In this talk, I'll show how empirical measurements can help ground otherwise qualitative discussions, through two case studies: interpretability and reward hacking. For interpretability, we will see that some recent popular methods underperform old baselines when measured systematically. For reward hacking, measurement reveals the existence of phase transitions with respect to model size, which could pose challenges to ensuring the safety of RL agents.

Beyond these specific observations, I will tie measurement to historical trends in science, and draw lessons from the success of biology and physics in the mid-20th century.

Zoom Recording [SUNet/SSO authentication required]