CSEE4121 - Computer Systems for Data Science

Spring ‘26, Columbia University


Course Overview

Data scientists and engineers increasingly have access to a powerful and broad range of systems they use to conduct big data analysis and machine learning at scale: from databases, large-scale analytics to distributed machine learning frameworks.

The goal of this class is to provide data scientists and engineers that work with big data a better understanding of the foundations of how the systems they will be using are built. It will also give them a better understanding of the real-world performance, availability and scalability challenges when using and deploying these systems at scale. In the course we will cover foundational ideas in designing these systems, while focusing on specific popular systems that students are likely to encounter at work or when doing research. The class will include two written homeworks and two programming assignments. All of the assignments will be done individually. In this course we will answer the following questions:

This class will be recorded, and there is no requirement for physical attendance.

Instructor

Waqar Aqeel

TAs

See course calendar for office hour schedule.

Ed

Link

Prerequisites

Students are expected to have solid programming experience in Python. This class is intended to be accessible for data scientists who do not necessarily have a background in databases, operating systems or distributed systems.

Time

Wednesdays 04:10 PM – 06:40 PM

Exams

Midterm: March 4th, 2026, 4:10 PM - 6:40 PM Study guide
Final: April 29th, 2026, 4:10 PM - 6:40 PM

Grade Breakdown

5% Homework 1 (Programming - SQL)
5% Homework 2 (Written)
10% Homework 3 (Programming - indexing and filtering data structures)
5% Homework 4 (Written)
25% In Person Midterm
50% In Person Final

Strict Late Submission Policy

There will be no late submissions. Late submissions will receive a grade of 0. You will have plenty of time to submit your assignments, so exercise proper time management.

Collaboration/Copying Policy

All assignments will be done individually. We will enforce this policy when checking the assignments (we use a code similarity system).

Course Materials

No textbook. Slides and assignments to be uploaded here.

Schedule (this is a work in progress, and is likely to change)

Course Calendar

Lecture stream

Week Topic Homework
1 Introduction (Slides)
2 Infrastructure for Big Data (Slides, Video)
3 Relational Data Model (Slides, Demo, Video) Homework 1 out
4 Transactions, Petflix (Slides, Podcast, Video)
5 Database techniques, Networking (Slides) Homework 2 out
6 Networking, Partitioning, 2PC (Slides, Video) Homework 1 due
7 Midterm (Video) Homework 2 due
8 MapReduce and Spark (Slides, Video)
9 Spring Break
10 Spark, Security and Privacy (Slides, Video) Homework 3 out
11 Security and Privacy (Slides, Video)
12 Single node ML (Slides, Colab, Video) Homework 4 out
13 Distributed ML
14 Distributed ML Homework 3 due
15 Final Exam Homework 4 due