Rating: 8.0/10.
System Design Interview – An insider’s guide by Alex Xu
Book written to study system design interviews, offering several case studies of systems and how the author would have designed them. It also presents common patterns often used in these large distributed systems. In each case, while the final system can become quite complex, it evolves one piece at a time from the initial version, which is typically just one server.
A few recurring themes are evident throughout. One key principle is avoiding a single point of failure; the malfunction of one component shouldn’t bring down the entire system. Therefore, most designs include data replication, multiple workers behind a message queue, a load balancer, etc. Some patterns include consistent hashing to replicate across multiple servers, message queues to turn a series of steps into a system that can be massively parallelized with number of workers at each step independent from each other; stateless servers are also favored over stateful designs, making scalability more straightforward.
Chapter 1: How to design a system to handle millions of users. The simplest system is just a web server, but it won’t manage many users. The first improvement is to separate the web service from the database, allowing the two to be scaled up independently. Vertical scaling means making the server bigger, but there are fundamental hardware restrictions on how much you can do this. On the other hand, horizontal scaling is more desirable for large-scale applications because it is not limited by hardware.
A load balancer evenly distributes incoming traffic to the web service. You can have this at both the web server layer and the database layer. In the database, you typically have a master database that handles all writes, and slave databases that are read-only and replicate data from the master database.
Next is caching. A cache is a temporary storage for heavy operations, so when a new request is similar, it can be retrieved from the cache instead of the database or web server. There are challenges in how to invalidate the cache and keep it consistent, especially across different regions. One type of cache is a Content Delivery Network (CDN), which is great for serving static assets and is geographically dispersed, so users from different regions can have fast latency.
Web architectures can be stateful or stateless. In stateful architectures, the service stores state data, which is not ideal for scaling across multiple servers. In a stateless architecture, none of the web services store data; only the database layer does, making it more scalable. A message queue is a helpful architectural component to communicate between multiple workers since the producer and consumers can be scaled up independently.
Sharding databases is beneficial when a single database can’t manage all the application data. The challenge is how to distribute users among the shards to ensure even distribution, especially when some celebrities might generate a lot of traffic. Generally, it’s not possible to perform join operations when sharded, so often the database needs to be denormalized.
Chapter 2: Back-of-envelope estimation is a technique to roughly estimate something, and you should have an idea of various latency numbers, like the time to perform various memory, disk, and network operations. Generally, cache < memory < disk < network for latency. Availability numbers are expressed in terms of how many “nines” of uptime, and you should think of this as how much time per day and per year your service is permitted to be down.
Chapter 3: A framework for how to approach a system design interview. You shouldn’t jump into a solution right away, but instead slow down and ask for clarifying requirements, communicating with the interviewer to co-design the system. Start with a high-level design before delving into individual architectural components, and you should be prepared to discuss edge cases like error handling, improving bottlenecks, or failure modes.
Chapter 4: Designing a Rate Limiter. Rate limiters are typically positioned in a separate component apart from the web server, situated in a middleware called an API Gateway. There are some rate limiting services available, so you don’t need to build your own. Rate limiting algorithms often have buckets by endpoint + userid or ip-address combo, and each bucket is rate limited separately.
There are several algorithms for rate limiting, each with its trade-offs. One method is the token bucket, where each request uses up a token which is slowly refilled. Another is the fixed window size, which limits requests past this window (eg: 1 hour) and then resets. The sliding window algorithm ensures that the number of requests within each sliding window doesn’t exceed a certain threshold. The state for rate limiting is typically stored in a in-memory database, like Redis, since speed is crucial. When the rate reaches its limit, the server sends a 429 header to inform the client that they are being rate-limited, it can also set various data headers to alert clients when they’re approaching their rate limit.
Chapter 5: Consistent hashing is useful for sharding data and requests evenly across servers. The naive hashing method would be to take the hash and use the modulus by the number of servers, but the problem is when a server is added or removed, almost all of the hashes need to be redistributed. Consistent hashing ensures that when you add or remove a server, only a small number of keys need to be remapped.
This is typically achieved using a hash ring, where the hash space is connected end-to-end, and each server occupies a specific position on the hash ring. To determine the server for a hash key, you go clockwise until you find the next one on the hash ring. When you add or remove a server, most keys remain unaffected. A problem is random distribution can introduce high variance, so the solution is to use virtual nodes, where each server is assigned multiple positions on the hash ring that all represent the same physical server, reducing the variance in key distribution.
Chapter 6: Designing a key-value storage to be highly scalable. In distributed databases, you have two choices when the network fails: a CP system will block when the network is partitioned, whereas an AP system can return stale data. To ensure redundancy, data is replicated across multiple partitions, using a hash ring to determine the target server for replication.
Consistency is maintained using quorum, where a server must receive acknowledgments from a set number of other servers before an operation is considered successful. To prevent multiple inconsistent states, one method is the vector clock: basically you maintain a versioning vector where each server maintains a counter that increments with every write operation; multiple servers can perform writes and reconciliate them, and also detect conflicts that represent conflicting writes.
Another aspect is handling failures. The gossip protocol is servers emitting heartbeats with current timestamps, if a server doesn’t emit a timestamp after a certain duration, other servers recognize it as down. When a server fails, its state needs reconciliation and transfer to another server. The Merkle tree data structure is used to determine which subsets need synchronization without copying all data.
For write operations, there is commit log that records the data before it is written to the storage servers and in-memory cache. For read operations, it first checks the in-memory cache, then uses Bloom filters for determining which database server contains the desired data.
Chapter 7: Designing a unique ID generator that is distributed. The problem is to generate IDs that are incrementing, so later IDs are larger than earlier ones, and they must be unique. Randomly generated UUIDs don’t have this auto-incrementing feature. Using ticket servers introduces a single point of failure; if the ticket server goes down, the whole system goes down. Therefore, one approach is the snowflake method, where the ID consists of the timestamp combined with the machine ID. Each machine has its independent sequence number that increments with each new request, ensuring that the IDs generated by one server can’t conflict with another’s.
Chapter 8: Designing a URL shortener. The objective is to track URLs and shorten them. The API has a POST endpoint to request a shortened URL and a GET endpoint, which returns a 301 response, so the browser redirects to the original long URL automatically.
One straightforward method is to hash the long URL. However, this hash might be too long, so we might only use its prefix, and if this prefix already exists then we consider a longer prefix, but this could lead to multiple database calls. An alternative approach is to use an auto-incrementing counter, converting it to base 62. This generates a short URL that’s easily readable, and uses the distributed unique ID generation discussed in the previous chapter.
Chapter 9: Designing a Web Crawler. This is useful for mining the web for data, archiving it, or creating a search engine with the goal of scraping all the data from the internet. The crawler has several parts that run in a loop. It starts with a set of seed URLs and maintains a URL frontier, which tracks which URLs need to be downloaded. In a continuous loop, the scraper downloads an HTML page from the frontier, checks if it has been seen before, extracts the links, and adds them to the new frontier. There will be duplicates which we can detect exact matches using a hash function.
One major concern is politeness: a web crawler should not send too many requests to a website in a short span of time, as this would essentially be a DDoS attack. To enforce politeness, you might use a set of queues where all the pages from one website end up in one queue, downloading only one page per queue at a time. A prioritizer may also use various queues since some pages are more likely to be high-quality than others. It’s important to respect the robots.txt standard, which a website can use to define pages to exclude.
Chapter 10: Designing a Notification System. At a high level, there are various types of notifications, including iOS, Android, SMS, and email. The notification server should unify these into a common interface. Upon receiving a request, it calls the relevant external third-party server to deliver the notification.
Before sending, the notification server might first connect to a database to gather additional user information, like opt-out preferences, then in some cases it will not send the notification. Message queues are useful when there’s a high volume of notifications, so that an overload in one type doesn’t affect others if they’re in separate queues. Further enhancements can include rate limiting, retry logic, alerting, analytics, etc.
Chapter 11: Designing a News Feed System. Think of this as a system similar to Facebook’s news feed, where users can make posts that then appear in their friends’ news feeds. The core of this design involves a fan-out service. You can choose to fan out on read or write. With fan-out on write, every time a user posts, it’s immediately sent to all their friends’ queues, making reading fast since the news feed is precomputed when you access your page. However, for celebrities with many followers, it’s better to fan out on read and pull the posts when a user fetches their feed. This way, you don’t waste resources on inactive users. Ideally, a hybrid of the two works best.
The news feed cache consists of tuples with user IDs and post IDs, representing the items a user sees. This post ID then serves as a key to another database to fetch the post content. When a user posts on your feed, the initial web server handles it, performing rate limiting and content filtering, before directing it to the newsfeed service. This service then executes the fan-out on write, updating the necessary caches and databases.
Chapter 12: Designing a Chat System. The main challenge is maintaining a connection between the server and client. Given that the HTTP protocol doesn’t easily allow the server to send messages to the client, the client keeps a long-lasting websocket connection to the server when online. Server memory is a constraint here since each connection consumes a notable amount of memory.
There are two data types in this system. The first is generic user data, storable in relational databases. The bulk of data, chat data, is better suited for specialized key-value stores, and usually only recent chats are accessed. Each message ID should be unique, at least within a chat group, so this ID generation can happen locally rather than needing a global ID generator.
When a message is sent, it’s appended to the message queue of all group members; this works best when there’s a cap on group chat size. You can think of it as an inbox for each recipient. Each device keeps track of its local maximum ID that it has seen, fetching subsequent IDs from the queue. To manage users’ online and offline states, use a presence server. This server stores the online and offline statuses in a dedicated key-value store. A heartbeat mechanism determines online status; if there’s no heartbeat for a while, the user is deemed offline. Changes in online status are then pushed to the message queue for all connected friends.
Chapter 13: Designing an Auto-complete Search System. This task involves creating a system similar to Google search where, as users type in a prefix, the system fetches the most common queries matching that prefix. Standard relational databases aren’t efficient for this, so we use a trie data structure. In this tree, each node represents a prefix leading to another tree with all words starting with that prefix; the leaf nodes contain words and their counts. This structure can be serialized into a KV database, where the key represents the node and the value is a dictionary of child nodes. Popular prefixes should be cached for faster access.
To keep the system efficient, the trie doesn’t need real-time updates and can be regenerated weekly. Data is sharded by prefixes, and not just by individual letters but by letter groups, as not all letters have the same starting frequency.
Chapter 14: Designing YouTube. The goal is to enable video uploads and ensure efficient streaming on a large scale. During an upload, the video goes to an original storage bucket. The system then transcodes the video into various formats suitable for streaming at different resolutions, triggering services that generate video metadata. These streaming formats are optimized in several ways to allow streaming without downloading the entire video. This content is served via a CDN for efficiency. The transcoding process is represented by a Directed Acyclic Graph (DAG) since video, audio, and metadata generation happen separately. Schedulers manage DAG steps, with each step having a queue to maximize parallelism when there are multiple sequential tasks in the DAG.
Chapter 15: Designing Google Drive. The goal is to let users upload files to a shared drive and synchronize them across devices. Initially, this could be done using a file system, but services like S3 can replicate files across regions for better reliability. For large files, splitting them into smaller blobs, so when a part of a file changes, only the impacted blobs need re-transmission. The system has a block storage service paired with a metadata server to track these blocks. When a file is updated, a notification service alerts all other connected clients. Those clients then fetch the metadata for the changes and download the relevant blocks.