Home | CSEE 4121 Spring 2025

CSEE4121 - Computer Systems for Data Science

Spring ‘25, Columbia University

Course Overview

Data scientists and engineers increasingly have access to a powerful and broad range of systems they use to conduct big data analysis and machine learning at scale: from databases, large-scale analytics to distributed machine learning frameworks.

The goal of this class is to provide data scientists and engineers that work with big data a better understanding of the foundations of how the systems they will be using are built. It will also give them a better understanding of the real-world performance, availability and scalability challenges when using and deploying these systems at scale. In the course we will cover foundational ideas in designing these systems, while focusing on specific popular systems that students are likely to encounter at work or when doing research. The class will include two written homework and two programming assignments. All of the assignments will be done individually. In this course we will answer the following questions:

How are popular big data systems designed and architected?
How to think about performance, scale and reliability of big data systems?
How do they remain available and not lose data despite frequent server and hardware failures?
How to reason about issues like security, privacy and data quality when conducting analysis on large data sets?

The class will be split into two sections, which will have identical content, and will be delivered by the same lecturer (Asaf Cidon) and served by the same TA team. The class will also be recorded, and there is no requirement for physical attendance.

Instructor

Asaf Cidon

TAs

Office Hour Calendar

Yuhong Zhong (Head TA): yz@cs.columbia.edu
Triyasha Ghosh Dastidar: tg2936@columbia.edu
Vahab Jabrayilov: vj2267@columbia.edu
Hans Shen: ys3524@columbia.edu
Harry Wang: hw2886@columbia.edu
Tal Zussman: tz2294@columbia.edu

Ed

Link

Prerequisites

Students are expected to have solid programming experience in Python. This class is intended to be accessible for data scientists who do not necessarily have a background in databases, operating systems or distributed systems.

Time

Section 1: Thursdays 10:10 AM – 12:40 PM
Section 2: Thursdays 1:10 PM – 3:40 PM

Final Exam

Date: Friday, May 2, 2025
Time: 10:10 AM – 12:10 PM
Location: International Affairs Building, IAB 417 - Altschul

Grade Breakdown

5% Programming Homework 1 (SQL)
5% Written Homework 1
10% Programming Homework 2 (indexing and filtering data structures)
5% Written Homework 2
20% “Take Home” Midterm
55% In Person Final

Strict Late Submission Policy

There will be no late submissions. Late submissions will receive a grade of 0. You will have plenty of time to submit your assignments, so exercise proper time management.

Collaboration/Copying Policy

All assignments will be done individually. We will enforce this policy when checking the assignments (we use a code similarity system).

Course Materials

No textbook.

Schedule (this is a work in progress, and is likely to change)

Week	Topic	Homework
1	Introduction (1/23, Slides)
2	Infrastructure for Big Data (1/30)
3	Relational Data Model (2/6)	Programming Homework 1 out (2/3)
4	Transactions and Logging (2/13)
5	Storage/memory hierarchy (2/20)	Written homework 1 out
6	Indexing (2/27)
7	Midterm	Written homework 1 due (3/3, 4:59:59 PM), Programming HW 1 due (3/6, 4:59:59 PM)
8	Challenges in Scaling (3/13)
9	Spring Break
10	Analytics (3/27)	Programming HW 2 out
11	ML Single Node (4/3)	Written HW 2 out
12	Distributed ML (4/10)
13	Security and Privacy (4/17)
14	Guest Lecture: Junaid Ahmed, VP Engineering, Observability, DataDog. Friday 2:30 PM for both sections, in person (Chandler 402) + live on Zoom + recorded (4/25); No regular lecture on Thursday	Programming HW 2 due (4/25, 5:59:59PM)
15	Final Exam (5/2, 10:10 AM – 12:40 PM)	Written HW 2 due (4/30, 1:59:59PM)