Inference with strategic agents and machine-learning predictions: Two modern inferential challenges
First, we consider the relationship between a regulator (the principal) and a pharmaceutical company (the agent). The pharmaceutical company wishes to sell a product to make a profit, and the FDA wishes to ensure that only efficacious drugs are released to the public. The efficacy of the drug is not known to the FDA, so the pharmaceutical company must run a costly trial to prove efficacy to the FDA. Critically, the statistical protocol used to establish efficacy affects the behavior of a strategic, self-interested pharmaceutical company; a lower standard of statistical evidence incentivizes the pharmaceutical company to run more trials for drugs that are less likely to be effective. The interaction between the statistical protocol and the incentives of the pharmaceutical company is crucial to understanding this system and designing protocols with high social utility. In this work, we discuss how the principal and agent can enter into a contract with payoffs based on statistical evidence in such a way that is robust to strategizing by the agent.
Second, machine learning algorithms are increasingly used as a component in scientific research. One enticing use case is to partially replace data that are expensive or time-consuming to collect directly with machine learning predictions. A great example is protein structure prediction. Researchers have painstakingly measured the structure of tens of thousands of proteins with expensive X-ray crystallography and now have machine-learning predictions for hundreds of millions more. However, this style of data collection presents a new challenge: the machine-learning predictions are sometimes wrong, so how can a researcher draw reliable conclusions? We introduce a framework to give valid confidence intervals and p-values based on data imputed via machine learning. The key principle is to measure the imperfections of the machine learning predictions and correct the final confidence intervals or p-values accordingly. Our technique applies to many functionals of interest, such as means, quantiles, and linear and logistic regression coefficients.