Main content start

How to tell the difference between machine learning and (bio)statistics

Date
Tue April 8th 2025, 4:30pm
Location
CoDa E160
Speaker
Mike Baiocchi, Stanford SoM and Statistics
Jordan Rodu, University of Virginia

We'll start this talk discussing a couple of studies: (i) a randomized trial to evaluate a sexual assault prevention program in Nairobi, Kenya and (ii) a remote detection operation to find and disrupt labor trafficking in the Amazon rainforest. These are both "data science" projects but they are wildly different in how they work. What makes them so different? For a long time in (bio)statistics we only had two fundamental ways of reasoning using data: warranted reasoning (e.g., randomized trials) and model reasoning (e.g., linear models). In the 1980s a new, extraordinarily productive way of reasoning about algorithms emerged: "outcome reasoning." Outcome reasoning has come to dominate areas of data science, but it has been under-discussed and its impact under-appreciated. In this talk we will discuss its current use (i.e., as "the common task framework") and its limitations. We will then discuss a way to extend this type of reasoning for use in assessing algorithms for deployment (i.e., "in the real world"). We developed this new framework so both technical and non-technical people can discuss and identify key features of their prediction problem.

This talk is based on the authors' 2023 Observational Studies paper, "When black box algorithms are (not) appropriate".