Veridical data science with a case study to seek genetic drivers of a heart disease

Tue November 15th 2022, 4:30pm
** Sloan 380C **
Bin Yu, UC Berkeley

"AI is like nuclear energy–both promising and dangerous."
Bill Gates, 2019

Data Science is a pillar of AI and has driven most of recent cutting-edge discoveries in biomedical research and beyond. Human judgment calls are ubiquitous at every step of a data science life cycle, e.g., in problem formulation, choosing data cleaning methods, predictive algorithms and data perturbations. Such judgment calls are often responsible for the "dangers" of AI.

To mitigate these dangers, we introduce in this talk a framework based on three core principles: Predictability, Computability and Stability (PCS). The PCS framework unifies and expands on the ideas and best practices of statistics and machine learning. It emphasizes reality check through predictability and takes a full account of uncertainty sources in the whole data science life cycle including those from human judgment calls such as those in data curation/cleaning. PCS consists of a workflow and documentation and is supported by our software package veridical or v-flow. Moreover, we illustrate the usefulness of PCS in the development of iterative random forests (iRF) for predictable and stable non-linear interaction discovery (in collaboration with the Brown Lab at LBNL and Berkeley Statistics). Finally, in the pursuit of genetic drivers of a heart disease called hypertrophic cardiomyopathy (HCM) as a CZ Biohub project in collaboration with the Ashley Lab at Stanford Medical School and others, we use iRF and UK Biobank data to recommend gene-gene interaction targets for knock-down experiments. We then analyze the experimental data to show promising findings about genetic drivers for HCM.