Treat this guide as a summary and outline of the materials covered in the
final exam. Deeper understanding of recurring systems themes will benefit
performance. Case studies mentioned in class or otherwise generally help
internalize the concepts.
Students are expected to know basic SQL and Python to understand
interpret code used in scenario-based questions. Completing Homework 3 and the
in-class workshops should help.
1. Batch Processing at Scale
1.1 MapReduce
The programming model is deliberately less expressive than SQL. Users write only a map function and a reduce
function, and the framework handles parallelism, fault tolerance, and scale.
Execution proceeds in three stages. Mappers process input splits in parallel. The framework shuffles
intermediate output so that all values sharing a key end up at the same reducer. Reducers then process each key
independently.
Word count is the canonical example. When a single pass is not enough, jobs are chained, as when computing
global percentages requires a second job that knows the total.
MapReduce runs on top of HDFS in a master and worker architecture. The scheduler tries to place each map task on
a node that already holds its input block, since the network is the cluster bottleneck and moving compute to data
is cheaper than the reverse.
Failures are treated as routine. A worker that misses heartbeats has its tasks reassigned, and re-execution is
safe because map and reduce functions are deterministic. The master is a single point of failure but can be
recovered from an execution log.
Stragglers dominate latency because the job finishes only when the slowest task does. Speculative execution
launches a duplicate of any apparently slow task and takes whichever copy finishes first. This is also why the
99th percentile latency matters more than the average.
1.2 Spark
Spark exists because MapReduce writes intermediate results to disk between every stage. Iterative and
interactive workloads pay this cost over and over. Spark keeps the working set in cluster memory and amortizes the
load cost across many operations.
The core abstraction is the Resilient Distributed Dataset, an immutable partitioned collection formed from
stable storage or other RDDs. The vocabulary of operations is much richer than MapReduce, including map, filter,
groupBy, and join.
Transformations are lazy and only build a plan. Actions trigger execution. Laziness is what allows the optimizer
to look at a whole chain of operations and plan across them, which eager evaluation would prevent.
Fault tolerance comes from lineage rather than replication. Each RDD records the operations that produced it, so
a lost partition is recomputed rather than restored from a copy. Recovery is slower than fetching a replica but
normal execution carries no replication overhead.
The DAG scheduler groups operations into stages, pipelines what it can, and avoids shuffles when partitioning
permits.
DataFrames and Spark SQL sit on top of RDDs and present a relational view with named columns. Knowing the schema
enables more aggressive optimization than raw RDD code allows. UDFs serve as an escape hatch for custom row logic
at the cost of some optimization opportunities.
2. Stream Processing
Stream processing sits between OLTP and OLAP. It is not OLTP because updates are not in-place row writes, and it
is not OLAP because results cannot wait for a batch window. Typical use cases include surge pricing, fraud
detection, personalization, and real-time dashboards.
Spark Streaming treats the stream as a series of very small batch jobs, on the order of half a second each.
Because each micro-batch is a normal deterministic Spark job, it inherits all the fault tolerance of the batch
engine without separate machinery. Other systems such as Flink and Kinesis make different tradeoffs but follow the
same general pattern of reusing batch-like recovery for streaming workloads.
The programming model is the DStream, a discretized stream that behaves like a series of RDDs. Most operations
resemble their batch counterparts. The distinctive ones are windowed operations such as countByWindow and
reduceByKeyAndWindow.
Windowing is most efficient when computed incrementally. A sliding window of the last ten minutes, updated every
second, should add the new batch and subtract the expired batch rather than recompute from scratch.
3. Security Foundations
3.1 Goals and Adversaries
Security is usually framed around four goals.
Confidentiality keeps data from leaking.
Integrity prevents tampering.
Availability keeps the system reachable.
Authenticity provides confidence about who sent a message or signed a document.
Any security claim depends on an adversary. A defense that stops a casual outsider may be trivial for a
nation-state, and a system designed to protect against a cloud provider is very different from one designed to
protect against a competitor.
End-to-end security is a strict requirement. The weakest layer determines the overall posture. Encrypted disks
do not help if the password is phished, and HTTPS does not help if the server logs requests in plaintext.
3.2 Cryptographic Primitives
Symmetric encryption uses a shared key for both encrypt and decrypt. AES is the dominant algorithm and is fast
enough for bulk data, but the parties have to share the key first.
Asymmetric encryption uses a public and private key pair. RSA is the canonical example, based on the hardness of
factoring large integers. It is much slower than symmetric encryption, so in practice it is used to exchange a
symmetric key rather than to encrypt the data itself.
Hashing is a one-way function that produces a fixed-size digest, used as a building block for integrity checks
and signatures.
A digital signature is the hash of a message encrypted with the sender's private key. It provides integrity and
authenticity but not confidentiality, since the message itself is typically sent in the clear.
Two practical rules apply. Never invent your own crypto, and use vetted libraries. Even correct primitives in
the wrong configuration are insecure, which is where most real failures come from.
3.3 Compliance
Compliance and security are not the same thing. A system can pass a HIPAA or GDPR audit and still be trivial to
breach, and a well-engineered system can fail an audit on paperwork grounds. Both matter for different reasons.
The major frameworks are:
HIPAA for healthcare data
GDPR for EU residents
PCI for payment cards
FERPA for student records
The principles that recur across them are worth following on engineering merit alone. These
include encrypting personal data at rest and in transit, keeping immutable audit logs, restricting access on a
need-to-know basis, supporting deletion requests, and respecting data residency rules.
4. Attacks on Data Systems
4.1 Injection
Injection attacks happen whenever untrusted input is interpreted as code rather than data. SQL injection is the
canonical case. The application builds a query by string concatenation, and a value such as
' OR 1=1 -- turns an authentication query into one that returns every user. The same trick scales to
exfiltration with UNION, modification with UPDATE, and destruction with DROP TABLE.
The structural fix is parameterized queries. The query template and its parameters are sent to the database
separately, and parameters are bound as data that is never parsed as SQL. Input validation, least-privilege
accounts, and web application firewalls add depth but do not substitute for the structural fix.
The pattern generalizes. NoSQL injection, command injection, log injection, and cross-site scripting are all
variations on mixing untrusted input with code at an interpreter boundary.
Remote code execution is a related attack where the attacker succeeds in executing their code on the server
albeit not through user input necessarily. Buffer overflows are a common mode.
4.2 Credentials, Phishing, and Denial of Service
Credential attacks exploit human behavior. Credential stuffing relies on password reuse across services. Brute
force and dictionary attacks try many guesses, and they are slowed by rate limiting, lockouts, and CAPTCHAs.
Multi-factor authentication is the real defense, since a stolen password becomes insufficient on its own.
Phishing sidesteps technical controls by attacking the human directly. Spear phishing uses personal context to
be more convincing and is the entry point for most major breaches. The most effective defense is hardware-backed
MFA, which verifies the domain and so resists fake login pages in a way that SMS or TOTP codes do not.
Denial of service comes in three flavors. Volumetric attacks saturate bandwidth, protocol attacks exhaust
resources such as the TCP connection table through SYN floods, and application-layer attacks send individually
plausible but expensive requests.
Distributed denial of service uses botnets, often built from compromised IoT
devices, and is harder to defend against because the traffic comes from everywhere. The Mirai botnet that brought
down Dyn in 2016 is the textbook example. Mitigations include CDNs that absorb traffic at the edge, traffic
scrubbing services, rate limiting per source, and auto-scaling, though the last can become an expensive contest.
5. Authentication and Authorization
Authentication asks who you are. Authorization asks what you are allowed to do. The two are independent, and
many breaches exploit the gap between them. A common failure is an authenticated user with too many permissions,
and the inverse failure is an unauthenticated user reaching a resource that should have required auth, as with a
public S3 bucket.
Authentication is described through three factors. Something you know is the weakest, since passwords are
guessed, phished, and reused. Something you have, such as a hardware key, is stronger because it requires physical
access. Something you are, meaning biometrics, is convenient but cannot be changed if compromised.
Multi-factor
authentication combines two or more of these and blocks the great majority of automated credential attacks. Among
MFA options, hardware keys and passkeys are stronger than SMS or TOTP, since the latter can still be phished by a
convincing fake login page.
Passwords should never be stored in plaintext. Storing a bare hash is insufficient because rainbow tables let
attackers reverse common passwords. The correct pattern is a salted hash computed by a deliberately slow function
such as bcrypt, scrypt, or argon2. A login that takes a quarter of a second is imperceptible to the user but
devastating to a brute-force attacker.
Sessions are managed in two main styles. Session cookies keep state on the server and are easy to revoke. JWTs
are signed self-contained tokens that scale well in distributed systems but are hard to revoke before they expire.
API keys are a programmatic variant and require their own hygiene, including avoiding source-control commits,
scoping to least privilege, and rotating periodically.
Single sign-on routes authentication through a central identity provider, simplifying onboarding, offboarding,
and uniform MFA enforcement.
OAuth 2.0 enables delegated authorization, letting a third-party application act on a
user's behalf within a scoped, revocable token rather than receiving the user's password.
Service-to-service authentication uses dedicated service accounts, cloud IAM roles that cloud workloads inherit
automatically, and mutual TLS in zero-trust architectures. The recurring failure mode is hardcoded credentials in
scripts or config files.
6. The Equifax Breach
Equifax is one of three major US credit bureaus. It held financial data on roughly 800 million people. The 2017
breach exposed personal data of about 147 million Americans, including Social Security numbers, birth dates,
addresses, driver's license numbers, and credit cards.
The entry point was an Apache Struts vulnerability. A patch was available the same day it was disclosed in March
2017. Internal teams were told to apply it, and a vulnerability scan a few days later reported no issues. The
attackers exploited the still-unpatched system in May, then spent 76 days inside the network exfiltrating data
before unusual traffic was noticed at the end of July. Public disclosure followed in September.
Six layered failures combined to make the breach possible. The Struts patch was not applied promptly. The
vulnerability scan failed to cover the affected system. An internal traffic-inspection tool had an expired SSL
certificate for nineteen months and so could not see the exfiltration. The network was flat, with no segmentation
to limit lateral movement. PII was stored unencrypted at rest. Database credentials were kept in plaintext config
files, which let the attackers jump from the initial compromised host to
the databases.
Each one of these failures is common on its own.
The breach was so severe because every layer that should have caught it
was simultaneously broken, revealing systematic neglect. Defense in depth
would have worked if the layers actually worked.
7. Access Control in Databases
Least privilege is the organizing principle. Every user and process should have only the permissions needed,
since the blast radius of a breach equals whatever the compromised account could touch. The principle is harder to
follow than to state because of convenience pressure, permission creep, and a lack of regular reviews.
Role-based access control (RBAC) introduces a layer of indirection between users and permissions. Permissions
are
granted to named roles such as analyst, engineer, or admin, and users are assigned to roles. Onboarding,
offboarding, and policy changes all happen at the role level, and the system is straightforward to audit.
Attribute-based access control (ABAC) instead writes policies that depend on attributes of the user, the
resource, and
the environment. A typical policy might allow finance employees to read confidential records only from the
corporate network during business hours.
ABAC is more expressive than RBAC but harder to audit, so most real
systems are hybrids that use RBAC for the common case and ABAC for context-sensitive exceptions.
Databases offer several enforcement mechanisms at different granularities. Views restrict which columns a user
can see. Row-level security restricts which rows are visible within the same table, useful for multi-tenant or
regional access. Column-level encryption stores sensitive values as ciphertext, providing defense in depth at the
cost of harder querying and key management. Dynamic data masking transforms displayed values based on the viewer
rather than encrypting the underlying storage.
Real-world failures are usually human. Shared service accounts erase accountability, permissions accumulate
without review, and overly strict controls produce shadow IT in which analysts move data to personal spreadsheets.
The right balance is regular access reviews, automated provisioning tied to HR systems, and audit logging, with
cultural commitment to least privilege rather than additional layers of technical enforcement.
8. Anonymization and Privacy Models
Removing direct identifiers such as names and Social Security numbers does not anonymize a dataset.
Quasi-identifiers, the fields that do not identify on their own but combine into fingerprints, do most of the
identifying work. Sweeney's 1997 result showed that about 87 percent of Americans can be uniquely identified by
ZIP code, date of birth, and gender. She demonstrated this by linking an anonymized Massachusetts medical dataset
against voter rolls and identifying the governor's records.
Anonymization techniques include suppression of fields, generalization to broader categories, pseudonymization,
masking, and perturbation. All of them trade utility for privacy.
The Netflix Prize illustrated how easy these techniques are to defeat. Netflix released ratings with usernames
removed, IDs randomized, and some values perturbed. Researchers cross-referenced the dataset with public IMDb
ratings and identified the vast majority of users from as few as six rated movies.
Sparsity kills anonymity. In high-dimensional data almost every person is unique, and uniqueness combined with
auxiliary information can lead to identification.
Released data cannot be unreleased.
Some formal models attempt to put guarantees underneath these intuitions, and each is broken by the next attack.
K-anonymity requires each record to be indistinguishable from at
least k-1 others on quasi-identifiers. But it is defeated by the homogeneity
attack, in which all matching records share the same sensitive
value.
L-diversity requires each group to contain at least l distinct
sensitive values. But it is defeated by distribution skew.
T-closeness requires each group's sensitive distribution to stay
close to the overall distribution. But it requires heavy generalization
that damages utility.
All three offer syntactic guarantees on the structure of the released
data. Successive constraints and tradeoffs motivate a more systematic
approach, like differential privacy.
9. Systems for Machine Learning
9.1 Single Node
9.1.1 Landscape
Many ML techniques used today existed by the 1990s. The current era began when models, data, and hardware all
crossed thresholds of scale. All three had to come together.
The transformer architecture is worth singling out because attention is parallelizable in a way that recurrent
networks are not, which made it a good fit for modern GPUs.
Traditional ML still matters. Tree-based methods such as XGBoost dominate tabular problems and run comfortably
on CPUs in seconds. One needs to consider multiple approaches to solving an ML problem before deciding on, e.g.,
neural networks.
9.1.2 Hardware
CPUs and GPUs have different strengths. CPUs handle branching and irregular workloads well and are correct for
many ML tasks. GPUs have thousands of cores that execute in lockstep, which suits the regular matrix operations
underlying deep learning.
GPU memory uses a high-bandwidth design that delivers an order of magnitude more bandwidth than standard DRAM
but at much lower capacity.
But software tools, libraries, and install base are equally as
important in a competitive market. Nvidia's dominance in ML hardware comes
as much from the CUDA software ecosystem as from the chips
themselves.
Memory is the binding constraint more often than compute. Model memory equals parameters times bytes per
parameter, and training adds roughly a factor of four for gradients and optimizer state. Activation memory scales
with batch size, sequence length, and depth, and is often the actual cause of out-of-memory errors.
The toolbox for managing memory includes mixed precision (almost a free lunch), gradient checkpointing (compute
for memory), quantization (especially valuable for inference), and tuning batch size to the largest value that
fits.
Data loading can also become a bottleneck and waste GPU cycles. If any upstream stage of the data pipeline is
slower than the
GPU's consumption rate, the GPU sits idle. Low utilization is almost always a data-loading or batch-size problem
rather than a model problem.
9.1.3 Training vs. Inference
Training and inference have different bottlenecks. Training is compute-bound, since large batches keep the
matmul cores busy. Inference, when generating one token at a time, is memory-bound, since the entire model must be
loaded from memory, and the cores spend a lot of their time waiting.
The KV cache turns the inference problem from being compute bound to
memory bound. It stores keys and values from earlier tokens
so that generating the next token does not require recomputing them, but it grows with sequence length and number
of concurrent users.
At scale, the KV cache can rival the model weights in memory footprint. Managing it well largely
determines how many users a server can serve at once.
9.2 Distributed ML
Once a model or dataset outgrows a single GPU, work has to be spread
across multiple servers. Some standard distributed systems challenges and
solution come into play.
9.2.1 Training
Data parallelism is the simpler strategy. Each worker holds a complete model replica, the batch is split across
workers, and gradients are combined so that every replica applies the same update.
The simplest implementation uses a parameter server that holds canonical weights and aggregates gradients
centrally, but the parameter server becomes a bottleneck as the number of workers grows.
AllReduce is the decentralized alternative in which every worker ends up with the aggregated result. Naive
all-to-all communication scales no better than the parameter server, but Ring AllReduce passes parameter slices
around a logical ring in two phases and reduces total communication to a linear function of the number of workers.
It has become the de facto standard.
Model parallelism handles the case where the model itself does not fit on a single GPU, by partitioning the
model across devices and passing intermediate activations between them. The naive version leaves most GPUs idle
most of the time.
Pipeline model parallelism splits each batch into smaller micro-batches and pipelines them through the stages,
so that all GPUs are doing useful work in steady state.
The largest training runs combine data and model parallelism, often layering three or four parallelism
strategies together. A model may be sharded across different GPUs, and
there maybe multiple copies of the sharded model scheme training on
subsets of data in parallel. Each axis of parallelism has its own
communication bottleneck.
9.2.2 Inference
Most new AI infrastructure being built today is inference-focused
rather than training-focused.
Inference used to be a much simpler problem than training. Reasoning
models and agents can turn one logical query into many sub-inferences,
potentially across multiple models.
Modern serving systems must pack many
models onto shared GPUs, and share memory across requests where possible.
The target is to minimize cost per inference and cost per token while
meeting latency and accuracy requirements. Memory sharing across queries
and minimizing waste, e.g. from fragmentation, is a major theme.
9.3 Open Challenges
9.3.1 Testing and Debugging
Testing deep learning models in safety-critical settings is an unsolved problem. A held-out accuracy score does
not guarantee coverage of the corner cases that cause real failures, and labeled data is expensive.
Adversarial testing finds inputs whose imperceptible perturbations cause dramatic misclassification but is
criticized as unrealistic. Approaches like DeepXplore frame testing as an optimization problem, generating
realistic inputs that maximize neuron coverage and finding cases where multiple models disagree.
9.3.2 Versioning and Rollback
Version control and rollback are far behind their software counterparts. A model's behavior depends on its code,
training data, features, hyperparameters, and random seeds, so small changes anywhere can shift accuracy
meaningfully.
The problem is worse for foundation models, since fine-tuned derivatives all inherit any bug in the base model,
and patching requires re-fine-tuning every descendant. There is no widely accepted tool for tracking these
lineages.
9.3.3 Heterogeneous Deployment
Heterogeneous deployment to edge devices turns one model into a fleet of effectively different models.
Quantization differences, compression pipelines, and hardware variation cause both accuracy and latency to drift
across devices, on top of the additional constraint of power consumption.
9.3.4 Privacy
Models leak information about their training data. The earlier belief that aggregation preserves privacy turns
out to be wrong in every model class examined. Generative models autocomplete training data, recommender models
can be inverted from auxiliary information, and classifiers are vulnerable to membership inference attacks.
Differential privacy protects users at the source. The model does not
even process protected sensitive information. Systematic, guaranteed noise
is used such that an observer cannot tell whether any particular
individual's data was included.
This is achieved by adding calibrated noise so that any single
individual's presence has bounded effect on the output. The cost is some
loss of accuracy, but the guarantee composes well across repeated queries
and does not keep getting broken by the next attack.
10. RAG and Vector Databases
Language models know only what was in their training data. For
applications that depend on private or rapidly changing information,
retraining is impractical and the context window is too small to hold
everything relevant. Retrieval-augmented generation handles this by
retrieving the most relevant chunks from an external store at query time and
prepending them to the prompt.
The retrieval store is a vector
database. Documents are chunked and passed through an embedding model that
produces fixed-dimensional vectors whose geometry encodes semantic meaning,
so that related texts map to nearby vectors. Queries are embedded into the
same space, and the database returns the top-k chunks closest to the query.
Approximate nearest-neighbor search is the norm because exact search does
not scale.
Vector indices come in two main families. Clustering
indices partition the space using something like k-means and search only the
clusters near the query.
Graph indices such as HNSW build a navigable graph and use
greedy traversal toward the query, providing state-of-the-art accuracy and
speed at the cost of higher memory and build complexity.
Realistic
indices contain billions of vectors and exceed single-server memory. On-disk
indices keep most of the data on disk and cache the hot parts, but disk
bandwidth becomes the new ceiling and the mismatch between disk block size
and graph node size wastes bandwidth on every read. Access is heavily skewed
in practice, which makes caching effective. Open problems include efficient
updates and deletes, smarter caching policies, and partitioning across
servers.