Pythonic statistics

Date

Tue March 29th 2022, 4:30pm

Location

Sequoia 200

Speaker

Dennis Sun, Cal Poly & Google

Python is quickly becoming a standard language for data science. In this talk, I will explore why, and also discuss some of Python's shortcomings from a statistician's perspective. I will then discuss several open-source software projects that I have worked on to make Python more friendly to statisticians.

The first project, Symbulate, is a library for specifying probability simulations. It treats random variables and random processes as first-class citizens that can be composed using standard operations. I will demonstrate how we have used Symbulate in the classroom to reinforce probability concepts and correct common misconceptions. This is joint work with Kevin Ross.

The second project, Meterstick, is a library built on top of a "grammar of data analysis". It is a declarative language for specifying the kinds of data science analyses that arise frequently in industry. I will discuss some of the technical challenges, especially in translating this grammar into backend languages like SQL. This is joint work with Xunmo Yang, Taylor Pospisil, and Omkar Muralidharan.