CSEE4121 - Computer Systems for Data Science
Spring ‘25, Columbia University
Data scientists and engineers increasingly have access to a powerful and broad
range of systems they use to conduct big data analysis and machine learning at
scale: from databases, large-scale analytics to distributed machine learning
frameworks.
The goal of this class is to provide data scientists and engineers that work
with big data a better understanding of the foundations of how the systems they
will be using are built. It will also give them a better understanding of the
real-world performance, availability and scalability challenges when using and
deploying these systems at scale. In the course we will cover foundational ideas
in designing these systems, while focusing on specific popular systems that
students are likely to encounter at work or when doing research. The class will
include two written homework and two programming assignments. All of the assignments will be done
individually. In this course we will answer the following questions:
The class will be split into two sections, which will have identical content, and will be delivered by the same lecturer (Asaf Cidon) and served by the same TA team. The class will also be recorded, and there is no requirement for physical attendance.
Asaf Cidon
Students are expected to have solid programming experience in Python. This class is intended to be accessible for data scientists who do not necessarily have a background in databases, operating systems or distributed systems.
Section 1: Thursdays 10:10 AM – 12:40 PM
Section 2: Thursdays 1:10 PM – 3:40 PM
5% Programming Homework 1 (SQL)
5% Written Homework 1
10% Programming Homework 2 (indexing and filtering data structures)
5% Written Homework 2
20% “Take Home” Midterm
55% In Person Final
There will be no late submissions. Late submissions will receive a grade of 0. You will have plenty of time to submit your assignments, so exercise proper time management.
All assignments will be done individually. We will enforce this policy when checking the assignments (we use a code similarity system).
No textbook.
Week | Topic | Homework |
---|---|---|
1 | Introduction | |
2 | Infrastructure for Big Data | Programming HW 1 out |
3 | Relational Data Model | |
4 | Transactions and Logging | Written homework 1 out |
5 | Storage/memory hierarchy | Programming HW 1 due |
6 | Indexing | Written homework 1 due, Programming HW 2 out |
7 | Midterm | |
8 | Challenges in Scaling | |
9 | Spring Break | |
10 | Analytics | Programming HW 2 due |
11 | ML Single Node | Written HW 2 out |
12 | Distributed ML | |
13 | Security and Privacy | |
14 | Guest Lecture: Junaid Ahmed, VP Engineering, Observability, DataDog. 5:30 PM for both sections, in person + live on Zoom + recorded | Written HW 2 due |
15 | Final Exam | |