CSEE4121 - Computer Systems for Data Science
Spring ‘26, Columbia University
Data scientists and engineers increasingly have access to a powerful and broad
range of systems they use to conduct big data analysis and machine learning at
scale: from databases, large-scale analytics to distributed machine learning
frameworks.
The goal of this class is to provide data scientists and engineers that work
with big data a better understanding of the foundations of how the systems they
will be using are built. It will also give them a better understanding of the
real-world performance, availability and scalability challenges when using and
deploying these systems at scale. In the course we will cover foundational ideas
in designing these systems, while focusing on specific popular systems that
students are likely to encounter at work or when doing research. The class will
include two written homeworks and two programming assignments. All of the
assignments will be done individually. In this course we will answer the
following questions:
This class will be recorded, and there is no requirement for physical attendance.
See course calendar for office hour schedule.
Students are expected to have solid programming experience in Python. This class is intended to be accessible for data scientists who do not necessarily have a background in databases, operating systems or distributed systems.
Wednesdays 04:10 PM – 06:40 PM
Midterm: March 4th, 2026, 4:10 PM - 6:40 PM Study guide
Final: April 29th, 2026, 4:10 PM - 6:40 PM
5% Homework 1 (Programming - SQL)
5% Homework 2 (Written)
10% Homework 3 (Programming - indexing and filtering data structures)
5% Homework 4 (Written)
25% In Person Midterm
50% In Person Final
There will be no late submissions. Late submissions will receive a grade of 0. You will have plenty of time to submit your assignments, so exercise proper time management.
All assignments will be done individually. We will enforce this policy when checking the assignments (we use a code similarity system).
No textbook. Slides and assignments to be uploaded here.
| Week | Topic | Homework |
|---|---|---|
| 1 | Introduction (Slides) | |
| 2 | Infrastructure for Big Data (Slides, Video) | |
| 3 | Relational Data Model (Slides, Demo, Video) | Homework 1 out |
| 4 | Transactions, Petflix (Slides, Podcast, Video) | |
| 5 | Database techniques, Networking (Slides) | Homework 2 out |
| 6 | Networking, Partitioning, 2PC (Slides, Video) | Homework 1 due |
| 7 | Midterm (Video) | Homework 2 due |
| 8 | MapReduce and Spark (Slides, Video) | |
| 9 | Spring Break | |
| 10 | Spark, Security and Privacy (Slides, Video) | Homework 3 out |
| 11 | Security and Privacy (Slides, Video) | |
| 12 | Single node ML (Slides, Colab, Video) | Homework 4 out |
| 13 | Distributed ML | |
| 14 | Distributed ML | Homework 3 due |
| 15 | Final Exam | Homework 4 due |