Computer Systems for Data Science — Midterm Study Guide


Table of Contents

  1. Performance Concepts
  2. Data Centers
  3. Relational Model & SQL Basics
  4. Protocol Buffers & API Schema
  5. Indexing & Query Optimization
  6. Transactions & ACID
  7. Concurrency Control & Locking
  8. Indexing Data Structures & Bloom Filters
  9. Caching
  10. Partitioning & Sharding
  11. Replication
  12. Distributed File Systems
  13. Two-Phase Commit (2PC)
  14. TCP/IP Networking Model
  15. Distributed Systems Architecture

1. Performance Concepts

Metrics

Example: Boeing 747 vs. Concorde

Plane DC→Paris Speed Passengers Throughput (pmph)
Boeing 747 6.5 hrs 610 mph 470 286,700
Concorde 3 hrs 1350 mph 132 178,200

Concorde has lower latency; 747 has higher throughput.

Amdahl’s Law

Predicts speedup when you improve part of a system.

Formula:

Speedup = 1 / ((1 - p) + p/s)

Where:

Maximum speedup (if improved part takes zero time): 1 / (1 - p)

Example: Speed up 10% of queries by 2x

Latency Numbers Every Programmer Should Know

Operation Latency Order of Magnitude
L1 cache reference 1 ns nanoseconds
L2 cache reference 4 ns nanoseconds
Main memory access 100 ns nanoseconds
SSD random read 20 µs (20,000 ns) microseconds
HDD seek 4 ms (4,000,000 ns) milliseconds
Datacenter network round-trip 500 µs microseconds
Internet round-trip (same coast) 30 ms milliseconds
Internet round-trip (cross-country) 100 ms milliseconds

2. Data Centers

Evolution

Physical Structure

Network Topology

  1. Top-of-Rack (ToR) Switch: Connects machines within a rack
  2. End-of-Row Router: Aggregates racks in a row
  3. Core Router: Connects datacenter to outside world

Higher in hierarchy → higher throughput, but also higher latency for across communication. Multipath Routing provides extra capacity and redundancy.

Power & Cooling

Power Usage Effectiveness (PUE):

PUE = Total Facility Power / IT Equipment Power

Power is ~25% of monthly operating cost and often the limiting factor.

Cooling methods:

Backup Power: Batteries for short glitches → diesel generators for longer outages

Datacenter Location Factors

AI Datacenters vs. Traditional

Similarities: Same rack/row topology, cooling still critical

Differences:

Common Failure Modes (per year for new datacenter)

Reliability must come from software!


3. Relational Model & SQL Basics

Core Concepts

Data Types

Type Description
INT32, INT64 Integers (32 or 64 bit)
FLOAT32, FLOAT64 Floating point numbers
CHAR(n) Fixed-length string (padded)
VARCHAR(n) Variable-length string (up to n)
DATE, TIMESTAMP Date/time values
BOOLEAN True/false

Set vs. Multiset vs. List

Collection Ordered? Duplicates?
Set No No
Multiset (Bag) No Yes
List Yes Yes

SQL uses multisets (duplicates allowed, order not guaranteed) because removing duplicates requires expensive sorting/hashing.

Keys and Constraints

Data Independence

Basic Query Structure

SELECT columns      -- what to return (projection)
FROM table          -- where to get data
WHERE condition     -- which rows to include (selection/filter)
GROUP BY column     -- how to group rows
HAVING condition    -- filter on groups (after aggregation)
ORDER BY column     -- how to sort results
LIMIT n             -- how many rows to return

Aggregate Functions

-- WRONG: WHERE with aggregate
SELECT dept, AVG(salary) FROM employees WHERE AVG(salary) > 50000 GROUP BY dept;

-- CORRECT: HAVING with aggregate
SELECT dept, AVG(salary) FROM employees GROUP BY dept HAVING AVG(salary) > 50000;

JOIN Types

CROSS JOIN (Cartesian Product):

SELECT * FROM A, B    -- every row of A paired with every row of B

INNER JOIN:

SELECT * FROM A INNER JOIN B ON A.key = B.key

Returns only rows where both sides have matching values.

LEFT JOIN:

SELECT * FROM A LEFT JOIN B ON A.key = B.key

Returns all rows from A, with NULLs for B columns when no match.

RIGHT JOIN: Like LEFT JOIN, but keeps all rows from B.

FULL OUTER JOIN: Keeps all rows from both tables.

Subqueries

IN: Value in result set

SELECT * FROM employees WHERE dept_id IN (SELECT id FROM departments WHERE location = 'NYC');

EXISTS: Result set is non-empty

SELECT * FROM employees e WHERE EXISTS (SELECT 1 FROM projects p WHERE p.lead = e.id);

ALL: Comparison with all values

SELECT * FROM employees WHERE salary > ALL (SELECT salary FROM employees WHERE dept = 'Sales');

ANY: Comparison with any value

SELECT * FROM employees WHERE salary > ANY (SELECT salary FROM employees WHERE dept = 'Sales');

4. Protocol Buffers & API Schema

API Schema

Data Serialization Formats

Format Type Pros Cons
JSON Text Human-readable, flexible Verbose, slow parsing
XML Text Structured, validated Very verbose
Protocol Buffers Binary Compact, fast, typed Not human-readable

Protocol Buffers (Protobuf)

API schema definition language from Google.

Example:

message Person {
  string name = 1;
  int32 id = 2;
  string email = 3;
  
  enum PhoneType {
    MOBILE = 0;
    HOME = 1;
    WORK = 2;
  }
  
  message PhoneNumber {
    string number = 1;
    PhoneType type = 2;
  }
  
  repeated PhoneNumber phones = 4;
}

Why Protobuf is Smaller

Serialization

Backward Compatibility

Never mark fields as required in Protobuf because:


5. Indexing & Query Optimization

What is an Index?

A separate data structure that speeds up lookups on a column. Like an index in a book — instead of scanning every page, you look up the topic and jump directly.

Creating Indexes

CREATE INDEX idx_name ON table(column);
CREATE INDEX idx_multi ON table(col1, col2);  -- composite index

Index Trade-offs

Benefits:

Costs:

When Indexes Help

When Indexes Don’t Help

Query Planner

Look for:

Things That Prevent Index Use


6. Transactions & ACID

What is a Transaction?

A sequence of operations that forms a single logical unit of work.

START TRANSACTION;
  UPDATE accounts SET balance = balance - 100 WHERE id = 1;
  UPDATE accounts SET balance = balance + 100 WHERE id = 2;
COMMIT;

ACID Properties

Property Meaning Mechanism
Atomicity All or nothing; partial failures are rolled back Write-ahead logging
Consistency Database moves from one valid state to another Constraints, triggers
Isolation Concurrent transactions don’t interfere Locking, MVCC
Durability Committed changes survive crashes Write-ahead logging, replication

Write-Ahead Logging (WAL)

OLTP vs. OLAP

Characteristic OLTP OLAP
Stands for Online Transaction Processing Online Analytical Processing
Operations Read/write, real-time updates Read-only, batch writes
Query size Small, targeted Large, scanning/aggregating
Concurrency Many concurrent users Fewer, longer queries
Examples MySQL, Postgres, Redis BigQuery, Snowflake, Redshift
Use cases Banking, gaming, user settings Business reports, analytics

Isolation Anomalies

Anomaly Description Example
Dirty Read Read uncommitted data T1 writes, T2 reads, T1 aborts
Lost Update Concurrent writes, one lost T1 and T2 both read X=10, both write X=X+1, result is 11 not 12
Unrepeatable Read Same row, different values T1 reads X, T2 updates X and commits, T1 reads X again
Phantom Read Row set changes between reads T1 counts rows, T2 inserts row, T1 counts again

Isolation Levels

Level Dirty Reads Lost Updates Unrepeatable Reads Phantom Reads
Read Uncommitted Possible Possible Possible Possible
Read Committed Prevented Possible Possible Possible
Repeatable Read Prevented Prevented Prevented Possible
Serializable Prevented Prevented Prevented Prevented

Most databases default to Read Committed for performance.

Serializability

A schedule is serializable if its effect equals some serial execution of the same transactions.

Conflict serializability: Build a conflict graph; if acyclic, schedule is conflict serializable.

Conflict types:


7. Concurrency Control & Locking

Why Concurrency?

Lock Types

Request \ Held None S X
S Grant Grant Wait
X Grant Wait Wait

Two-Phase Locking (2PL)

Rule: A transaction cannot acquire new locks after releasing any lock.

Two phases:

  1. Growing phase: Acquire locks (no releases)
  2. Shrinking phase: Release locks (no acquires)

2PL guarantees conflict serializability (if it completes).

Strict 2PL

Rule: Hold ALL locks until COMMIT or ABORT.

Benefits:

Deadlocks

A cycle of transactions waiting for each other’s locks.

Example:

Detection: Waits-for graph; cycle = deadlock

Resolution: Abort one transaction (usually youngest or one with least work)

Prevention strategies:


8. Indexing Data Structures & Bloom Filters

B-Tree / B+ Tree

Most common index structure.

Properties:

Dense vs. Sparse Indexes

Type Description When to Use
Dense Entry for every search key value Required for secondary indexes
Sparse Entry for some values only Only works when data is sorted by search key

Why must secondary indexes be dense? Secondary indexes point to records that aren’t physically sorted by that column, so you can’t interpolate between entries.

Bloom Filters

A probabilistic data structure for checking set membership very fast.

Structure: Array of m bits, initialized to 0; k hash functions

Insert: Hash item k times, set those bit positions to 1

Lookup: Hash item k times, check if ALL positions are 1

Properties:

Use case: Quickly filter out definite non-members before expensive lookup.

Trade-off: Larger filter (more bits) = fewer false positives but more memory


9. Caching

Why Cache?

Avoid repeated expensive computation or data fetching.

Cache Concepts

Cache Problems

  1. Assignment: Where to store data and how?
  2. Consistency: What happens when data is updated or deleted?
  3. Eviction: What to remove when cache is full?

Eviction Policies

Policy Description Pros Cons
FIFO First in, first out Simple Doesn’t consider access patterns
LRU Least recently used Good for temporal locality Overhead to track recency
LFU Least frequently used Good for popularity Doesn’t adapt to changes
OPT Evict item used furthest in future Optimal (theoretical) Requires future knowledge

OPT (Belady’s algorithm) is optimal but impractical — useful as benchmark.

Cache Consistency

When underlying data changes:


10. Partitioning & Sharding

Why Partition?

Partitioning Schemes

Round-Robin:

Hash Partitioning:

Range Partitioning:

Execution Skew

Consistent Hashing

Solves the redistribution problem.

Concept: Place servers on a ring (0 to 1). Each key hashes to a point on the ring and goes to the next server clockwise.

Adding a node: Only the new node’s immediate neighbor transfers data.

Removing a node: Only its neighbor absorbs its data.

Benefits:

Used by: Cassandra, DynamoDB, Akamai


11. Replication

Why Replicate?

Replication Factor

Typically replicate to 3 nodes. Why 3?

Master-Replica (Primary-Copy)

Replication Locations

Consistency Challenges


12. Distributed File Systems

What is a File System?

Provides logical concept of files with paths; supports read, write, create, delete.

Distributed File System

File system spanning multiple machines with partitioning and replication.

HDFS (Hadoop Distributed File System)

NameNode as single point of failure:

File Systems vs. Databases


13. Two-Phase Commit (2PC)

The Problem

Distributed transaction spans multiple servers. Must ensure atomicity: commit at ALL servers or abort at ALL servers. Cannot have partial commits.

Roles

Protocol Phases

Phase 1: Prepare

  1. Coordinator sends PREPARE to all participants
  2. Each participant:
    • Writes PREPARE record to local log
    • If can commit: vote YES
    • If cannot: vote NO

Phase 2: Commit/Abort

  1. Coordinator collects votes:
    • ALL yes → send COMMIT to all
    • ANY no → send ABORT to all
  2. Each participant writes COMMIT or ABORT to log
  3. Participants acknowledge to coordinator

Failure Handling

Uses write-ahead logging at each node.

Participant crash before voting: Coordinator times out, aborts transaction.

Participant crash after voting YES: On recovery, check log and ask coordinator for decision.

Coordinator crash after collecting votes: Participants may be “blocked” waiting for decision.

Fail-Stop vs. Byzantine

2PC assumes fail-stop model.


14. TCP/IP Networking Model

The 5-Layer Model

Layer Name Function Protocols/Examples
5 Application Software communication HTTP, HTTPS, FTP, SMTP, SSH, DNS
4 Transport App-to-app delivery, reliability TCP, UDP, QUIC
3 Network Global routing IPv4, IPv6, BGP
2 Data Link Local delivery Ethernet, MAC, ARP, DHCP
1 Physical Bits to signals Fiber, copper, WiFi, 5G

Layer 1: Physical

Layer 3: Network

Layer 4: Transport

TCP vs. UDP:

Feature TCP UDP
Connection Required (handshake) None
Reliability Guaranteed delivery Best-effort
Ordering Preserved Not guaranteed
Speed Slower Faster
Use cases HTTP, SSH, FTP DNS, video streaming, games

TCP guarantees:

QUIC: Modern protocol that runs over UDP but provides TCP-like reliability with less handshake overhead. Used by Google, YouTube, Meta.

Layer 5: Application


15. Distributed Systems Architecture

The 8-Layer Stack (from global to local)

One way to divide and understand a global scale system:

  1. DNS Routing: Map domain name to nearest healthy datacenter
  2. API Gateway: Authenticate, route requests, circuit breaker
  3. Microservices: Parallel fan-out to multiple services
  4. Data & Caching: Cache hits vs. database queries
  5. CDN: Video served from ISP-local boxes
  6. Pods & Replication: Leader election, consensus
  7. VMs & Containers: Isolation and orchestration
  8. Cores & Threads: Thread scheduling, parallelism

DNS Routing

API Gateway

Caching Layer

Consistency Spectrum

Level Behavior Use Case
Eventual Write returns before replication; replicas catch up eventually Analytics, view counts
Session User sees own writes; others may see stale data briefly Watch lists, profiles
Strong Write synchronously replicates everywhere Billing, payments

CDN (Content Delivery Network)

Adaptive Bitrate Streaming

Concurrency vs. Parallelism

VMs vs. Containers

Aspect VM Container
Isolation Own OS kernel Shared kernel
Overhead Higher (full OS) Lower (just process)
Security Stronger (hypervisor boundary) Weaker (kernel bug affects all)

Kubernetes: Orchestrates containers — schedules, scales, heals, deploys. Thundering herd: Sudden burst of incoming request that defies usual patterns. Retry storm: Failing clients all keep retrying, increasing load and cascading failures.


Good luck on the exam!