# Statistics Data Science Curriculum (2024-25)

This focused MS track is developed within the structure of the current MS in Statistics and new trends in data science and analytics. Upon the successful completion of the Data Science MS degree students will be prepared to continue to a related doctoral program or as a data science professional in industry. Completing the MS degree is not a direct path for admission to the PhD program in Statistics.

*This program is not an online degree program.*

## Coursework

The Data Science track develops strong mathematical, statistical, computational and programming skills, in addition to providing fundamental data science education through general and focused electives requirement from courses in data sciences and other areas of interest.

As defined in the general Graduate Student Requirements, students must maintain a grade point average (GPA) of 3.0 or better and classes must be taken at the 200 level or higher. Students satisfying the course requirements of the Data Science track do not satisfy the other course requirements for the M.S. in Statistics

The total number of units in the degree is 45, majority of which must be taken for a letter grade.

Submission of approved Master's Program Proposal, signed by the master's advisor, to the student services officer by the end of the first quarter of the master's degree program. A revised program proposal is required to be filed whenever there are changes to a student's previously approved program proposal.

There is no thesis requirement.

### Prerequisites

Stronger preparation is assumed, so students have more time to focus on computing and large-scale methods.

- Multivariable calculus and linear algebra at the level of MATH 51
- Programming at the level of CS 106B
- Intermediate statistics (multiple regression and ANOVA, possibly without linear algebra) at the level of STATS 191
- Introductory probability at the level of STATS 118

Prerequisites are enforced for undergraduate students. Anterequisites are enforced for all students.

## 2024-25 Data Science Curriculum

Students must demonstrate breadth of knowledge in the field by completing courses in these core areas.

**Statistics core****courses**(16 units, letter graded)**Computational methods for data science**(6 units, letter graded)**Applied machine learning & AI**(9 units, letter graded)**Data Science electives****Practical requirement**(3 units, may be taken pass/fail)

### Statistics Core Courses - 5 courses

**(16 units; letter graded)**

**STATS 200: Introduction to Statistics Theory**(4 units)- Advanced students may replace this with STATS 300A.

**STATS 203: Regression Models and ANOVA**(3 units)- Advanced students may replace this with STATS 305A.

**STATS 209: Introduction to Causal Inference****STATS 263: Design of Experiments**- Advanced students may replace this with ECON 272 or CS 288.

**STATS 217: Introduction to Stochastic Processes I**- Advanced students may replace this with STATS 218 or 219.

**STATS 202****/****216****: Introduction to Statistical Learning**(3 units)- Advanced students may replace this with STATS 315A.

It is strongly recommended to take these classes in the following sequence during Year 1:

1. Autumn – STATS 200

2. Winter – STATS 203 + STATS 217

### Computational Methods for Data Science - 2 courses

**(6 units, letter graded)**

Students should take two classes from this menu. Students may also petition to use other classes that are focused on optimization, scientific computing and/or large-scale data analyses for this requirement.

**CME 213: Introduction to parallel computing using MPI, openMP, and CUDA**

This class will give hands-on experience with programming multicore processors, graphics processing units (GPU), and parallel computers. The focus will be on the message passing interface (MPI, parallel clusters) and the compute unified device architecture (CUDA, GPU). Topics will include multithreaded programs, GPU computing, computer cluster programming, C++ threads, OpenMP, CUDA, and MPI. **Pre-requisites include C++, templates, debugging, UNIX, makefile, numerical algorithms (differential equations, linear algebra). Terms: Spr | Units: 3**

**CME 302: Numerical Linear Algebra**

Solution of linear systems, accuracy, stability, LU, Cholesky, QR, least squares problems, singular value decomposition, eigenvalue computation, iterative methods, Krylov subspace, Lanczos and Arnoldi processes, conjugate gradient, GMRES, direct methods for sparse matrices.** Prerequisites: ****CME 108****/ ****Math 114**** and one of ****Math 104**** or ****Math 113****. Terms: Aut | Units: 3**

**CME/EE 364A: Convex Optimization I**

Convex sets, functions, and optimization problems. The basics of convex analysis and theory of convex programming: optimality conditions, duality theory, theorems of alternative, and applications. Least-squares, linear and quadratic programs, semidefinite programming, and geometric programming. Numerical algorithms for smooth and equality constrained problems; interior-point methods for inequality constrained problems. Applications to signal processing, communications, control, analog and digital circuit design, computational geometry, statistics, machine learning, and mechanical engineering. **Prerequisite: linear algebra such as ****EE263****, basic probability. Terms: Win | Units: 3**

**CS 229S: Systems for Machine Learning**

Deep learning and neural networks are being increasingly adopted across industries. They are now used to serve billions of users across applications such as search, knowledge discovery, and productivity assistants. As models become more capable and intelligent, this trend of large-scale adoption will continue to grow rapidly. **Terms: Aut**

**CS 245: Principles of Data-Intensive Systems**

Most important computer applications need to reliably manage and manipulate datasets. This course covers the architecture of modern data storage and processing systems, including relational databases, cluster computing frameworks, streaming systems and machine learning systems. Topics include storage management, query optimization, transactions, concurrency, fault recovery, and parallel processing, with a focus on the key design ideas shared across many types of data-intensive systems. **Prerequisites: ****CS 145****, 161. (**Last offered Winter 2022)

**CS 246: Mining Massive Data Sets**

The availability of massive datasets is revolutionizing science and industry. This course discusses data mining and machine learning algorithms for analyzing very large amounts of data. Topics include: Big data systems (Hadoop, Spark); Link Analysis (PageRank, spam detection); Similarity search (locality-sensitive hashing, shingling, min-hashing); Stream data processing; Recommender Systems; Analysis of social-network graphs; Association rules; Dimensionality reduction (UV, SVD, and CUR decompositions); Algorithms for large-scale mining (clustering, nearest-neighbor search); Large-scale machine learning (decision tree ensembles); Multi-armed bandit; Computational advertising. **Prerequisites: At least one of CS107 or ****CS145****. Terms: Win | Units: 3-4**

**CS 273C: Cloud Computing for Biology and Healthcare**

Big Data is radically transforming healthcare. To provide real-time personalized healthcare, we need hardware and software solutions that can efficiently store and process large-scale biomedical datasets. In this class, students will learn the concepts of cloud computing and parallel systems architecture. This class prepares students to understand how to design parallel programs for computationally intensive medical applications and how to run these applications on computing frameworks such as cloud computing and high-performance computing (HPC) systems. **Prerequisites: familiarity with programming in Python and R. Terms: Spr | Units: 3**

**MS&E 236/CS 225: Machine Learning for Discrete Optimization**

Machine learning has become a powerful tool for discrete optimization. This is because, in practice, we often have ample data about the application domain data that can be used to optimize algorithmic performance, ranging from runtime to solution quality. This course covers how machine learning can be used within the discrete optimization pipeline from many perspectives, including how to design novel combinatorial algorithms with machine-learned modules and configure existing algorithms' parameters to optimize performance. Topics will include both applied machinery (such as graph neural networks and reinforcement learning) as well as theoretical tools for providing provable guarantees. **Terms: Spr**

### Applied Machine Learning & AI - 3 courses

**(9 units, letter graded)**

Students should take 3 classes from this menu. Students may also petition to use other classes that fit the requirement.

**STATS 220: Machine Learning Methods for Neural Data Analysis****STATS 232: Machine Learning for Sequence Modeling****STATS 256: Modern Statistics for Modern Biology (TBD)****STATS 315A: Modern Applied Statistics: Learning****STATS 315B:****Modern Applied Statistics: Learning II****(TBA)****CS 221: Artificial Intelligence: Principles and Techniques****CS224N: Natural Language Processing with Deep Learning****CS 224R: Deep Reinforcement Learning****CS 224W: Machine Learning with Graphs****CS 228: Probabilistic Graphical Models: Principles and Techniques****CS 229/STATS 229: Machine Learning****CS 234: Reinforcement Learning****CS 235: Computational Methods for Biomedical Image Analysis and Interpretation****CS 236: Deep Generative Models**

### Data Science Electives

**(9 units, may be taken pass/fail)**

In consultation with the student's program advisor, the student selects courses within the realm of data science to fulfill the remaining coursework required for the degree (200+ level).

### Practical Component or Capstone Project

Students are required to take **minimum of 3 units **of practical component that may include any combination of:

A capstone project, supervised by a faculty member and approved by the student's advisor. The capstone project should be computational in nature. Students should submit a one- page proposal, supported by the faculty member and sent to the student's Data Science advisor for approval (at least one quarter prior to start of project).

**Statistical Consulting Workshop (STATS 390) (repeatable)**- This class requires mastery of Statistics at the (graduate) level necessary to provide consultation to fellow members of the university.
- Students attend weekly lectures on Friday to discuss consulting cases and various statistical techniques that arise frequently in consulting.

**Consulting Workshop on Biomedical Data Science (BIODS 232)****Xplore Projects (CME 291)**Units: 3 | Repeatable 2 times (up to 6 units total) - Enrollment by application only.**Stanford ML group -****AI for Health Care****Bootcamp (CS 199, CS 399, etc.)****Independent Study (STATS 299**) In consultation with your advisor, independent study/directed reading with permission of statistics faculty (repeatable).**Research in The Computational Neuroscience Laboratory (PSYCH 399)**- Enrollment by application only.- Students interested in research at the intersection of Artificial Intelligence, Statistics, Medical Image Analysis, and Neuroscience to collaborate with our group in multiple exciting projects. The research can start with an independent study in the CNS lab (part of the School of Medicine). The projects are mainly unpaid and in exchange for independent research study credits.

### Data Science Proposal Form

This is a formal agreement with the department and your faculty advisor to fulfill the degree requirements with a proposed course list. The proposal form may be revised before the student's final quarter.

The program proposal should represent a coherent program of study that includes components to synthesize material covered and to allow for some degree of depth. Depending on the field of study and department interests, such a component could be a project, a long paper, a final examination, a sequencing of course work or seminars, or a research requirement.

Reach out to your faculty advisor to discuss questions or concerns about course placement, content, career direction while completing your proposal form - and at any point during your time at Stanford.

The program proposal must be approved by your advisor and program officer (cgates [at] stanford.edu (cgates[at]stanford[dot]edu)) before the end of your first quarter.

Substitutions of required courses must be approved by the student's faculty advisor before the planned enrollment quarter.

Students should update the department/program SSO (cgates [at] stanford.edu (cgates[at]stanford[dot]edu)) with their final program form the quarter before the student can be cleared for degree conferral.