CSEE4121 - Computer Systems for Data Science
Spring ‘22, Columbia University
Programming Homework 1 |
Programming Homework 2 |
Data scientists and engineers increasingly have access to a powerful and broad range of systems
they use to conduct big data analysis and machine learning at scale: from databases, large-scale
analytics to distributed machine learning frameworks.
The goal of this class is to provide data scientists and engineers that work with big data a better
understanding of the foundations of how the systems they will be using are built. It will also
give them a better understanding of the real-world performance, availability and scalability
challenges when using and deploying these systems at scale. In the course we will cover
foundational ideas in designing these systems, while focusing on specific popular systems that
students are likely to encounter at work or when doing research. The class will include some
written homework and programming assignments. One of the programming assignments will be
done in pairs, and the rest will be done individually.
In this course we will answer the following questions:
Asaf Cidon and Sambit Sahu
OH: By appointment only
Asaf Cidon - Fridays 10:10 AM - 12:40 PM | 501 Northwest Corner Building
Sambit Sahu - Thursdays 7:00 PM - 9:30 PM | 402 Chandler
Please refer Ed for Office Hours
Rahul Chaudhari | Shantanu Jain |
Wei Hao | Koushik Roy |
Aashish Arora | Manisha Rajkumar |
Harshitha Malireddi | Ruchika Goel |
Joy Parikh | Suvansh Dutta |
Sai Karthik Ammanamanchi | Zhejian Jin |
Gaurav Sinha |
Ed link has been posted on courseworks!
Students are expected to have solid programming experience in Python or with an equivalent programming language. This class is intended to be accessible for data scientists who do not necessarily have a background in databases, operating systems or distributed systems.
Week | Topic | Homework | |
---|---|---|---|
1 | Introduction (Slides) | ||
2 | Relational Data Model (Slides) | Programming Homework 1 released (February 1, 2022) | |
3 | Relational Data Model | ||
4 | Transactions and Logging (Slides) | Written Homework 1 released | |
5 | Storage/memory hierarchy (Slides) | ||
6 | Indices and bloom filters | Programming Homework 1 due (February 25, 2022 4:59:59PM) | |
7 | Distributed file systems (Slides) | Written Homework 1 due (March 6, 2022 4:59:59PM) | |
8 | Midterm (all material up to Topic 4, not including RocksDB) | ||
9 | Spring Break | ||
10 | MapReduce and stragglers (Slides) | ||
11 | Spark and distributed analytics | Programming Homework 2 released | |
12 | Caching (Slides) | ||
13 | Machine Learning (Slides) | Written Homework 2 out | |
14 | Security (Slides) | ||
15 | Data Quality and Review | Programming Homework 2 due, Written Homework 2 due(April 29, 2022 4:59:59PM) | |
16 | Final Exam: May 6 |
20% Programming Homework 1
10% Written Homework 1
20% Programming Homework 2
10% Written Homework 2
15% Midterm
25% Final
Each student will have a total of 3 late days for the entire semester. After all late days are used, there will be a 5% penalty for submission within 24 hrs of the deadline, 10% penalty for submission within 48hrs of the deadline and 20% penalty for submission within 72 hrs of the deadline. No submissions will be accepted after 72 hrs from the deadline.
Programming assignment 1 and the written assignments will be done alone. Programming assignment 2 will be done in pairs. You may not copy answers and code. We will enforce this policy when checking the assignments (we use a code similarity system).
No textbook.