Baozi Training Blog (包子IT面试培训博客): System design interview: how to design a feeds system (e.g., Twitter, Instagram and Facebook news feed)

Friday, December 27, 2019

Please use this "READ MF!" framework for software engineer system interview purpose.

So far the best detailed explanation on designing twitter is from Raffi (used to be VP of Twitter) on QCon. Presentation, Slides [All credits go to QCon). Raffi is very smart and articulate, really solid guy!
Read heavy system, not write heavy, optimized for user timeline. To be clear, there are two timelines, one is user's own tweets (easy to do), the other is the main timeline which is an aggregation of all the tweets from the people that user follows.
Pre-calculate all the timelines. This is the interesting part of the design vs using a mysql and having index to query in realtime, which would not be scalable. When a tweet is posted, the tweets service would

Store this tweet in memory, and that later would be flushed to a main DB
Call the fanout deliver service to publish this tweet to all the users' timeline that followed this particular user. It could simply store a tweet ID (and later the content could be retrieved from the Tweets Cache) or hydrating the entire text content is also fine (note how we want to handle eidt or delete, twitter probably doesn't allow delete)
Call search service to index (Lucene). The search index is also hosted in memory on Redis. Note search here needs to fanout to all search clusters but due to the in memory hosting, it's acceptable.

Always remember disk access is at least 100k times slower than memory access, e.g, disk is 10ms vs 100ns on memory. https://gist.github.com/jboner/2841832
With the pre-caculate timeline design, there might be race conditions when celebrity (people with millions of followers) starts to talking to each other with replies. E.g., celebrity A tweets something, takes 30 sec to deliver to all the followers, celebrity B replies before deliver finishes, some followers follow both A and B might see B's reply first before A's original post. One cheat workaround is to sort by timestamp or tweet IDs, but they are also experimenting with only pre-calculate non-celebrity tweets, and when generating timeline, realtime fetching the celebrity tweets. It depends which way is better and also the user experience. This is a good stop point to talk to your interviewers in real world about tradeoffs.
Since twitter is heavily relying on cache, you might want to checkout how they optimized the caching with twemproxy.

Baozi Training Blog (包子IT面试培训博客)