CSEE4121 - Computer Systems for Data Science
Spring ‘24, Columbia University
Data scientists and engineers increasingly have access to a powerful and broad
range of systems they use to conduct big data analysis and machine learning at
scale: from databases, large-scale analytics to distributed machine learning
frameworks.
The goal of this class is to provide data scientists and engineers that work
with big data a better understanding of the foundations of how the systems they
will be using are built. It will also give them a better understanding of the
real-world performance, availability and scalability challenges when using and
deploying these systems at scale. In the course we will cover foundational ideas
in designing these systems, while focusing on specific popular systems that
students are likely to encounter at work or when doing research. The class will
include some written homework and programming assignments. One of the
programming assignments will be done in pairs, and the rest will be done
individually. In this course we will answer the following questions:
Sambit Sahu - ss3876@columbia.edu
OH: CEPSR (Schapiro Building) 7W51 Thursday 6:00pm-6:50pm
Mooizz Abdul - ma4496@columbia.edu | Nipun Navin Agarwal - nna2132@columbia.edu |
Mohini Mangesh Bhave - mb5157@columbia.edu | Samhit Chowdary Bhogavalli - sb4845@columbia.edu |
Tanisha Bisht - tb3061@columbia.edu | Ajit Sharma Kasturi - ak5055@columbia.edu |
Anvith Pabba - ap4450@columbia.edu |
Students are expected to have solid programming experience in Python or with an equivalent programming language. This class is intended to be accessible for data scientists who do not necessarily have a background in databases, operating systems or distributed systems.
Week | Topic | Homework | |
---|---|---|---|
1 | Introduction (Slides) | ||
2 | Relational Data Model (Slides) | Programming Homework 1 released (February 1st, 2024) | |
3 | Relational Data Model | ||
4 | Transactions and Logging (Slides) | ||
5 | Storage/memory hierarchy (Slides) | Written Homework 1 released (February 15, 2024) (On Gradescope) | |
6 | Indices and bloom filters | Programming Homework 1 due (February 22, 2024 4:59:59PM) | |
7 | Distributed file systems (Slides) | Written Homework 1 due (February 29, 2024 4:59:59PM) | |
8 | Midterm on 3/7 (all material up to Topic 4, not including RocksDB) | ||
9 | Spring Break | ||
10 | MapReduce and stragglers (Slides) | ||
11 | Spark and distributed analytics | ||
12 | Caching (Slides) | Programming Homework 2 released (April 4th, 2024) | |
13 | Machine Learning (Slides) | ||
14 | Security (Slides) | Written Homework 2 released (April 11th, 2024) | |
15 | Data Quality and Review | Programming Homework 2 due (April 25th, 2024) | |
16 | Written Homework 2 due (May 2, 2024) | ||
17 | Final Exam (May 9, 2024) |
20% Programming Homework 1
10% Written Homework 1
20% Programming Homework 2
10% Written Homework 2
15% Midterm
25% Final
Each student will have a total of 3 late days for the entire semester. After all late days are used, there will be a 5% penalty for submission within 24 hrs of the deadline, 10% penalty for submission within 48hrs of the deadline and 20% penalty for submission within 72 hrs of the deadline. No submissions will be accepted after 72 hrs from the deadline.
Programming assignment 1 and the written assignments will be done alone. Programming assignment 2 will be done in pairs. You may not copy answers and code. We will enforce this policy when checking the assignments (we use a code similarity system).
No textbook.