Saturday, April 4, 2020

System design interview: how to design a video platform (e.g, Youtube, Netflix)

System design interview: how to design a video platform (e.g., Youtube, Netflix)

Methodology: READ MF!

[Originally from the Post: System design interview: how to design a chat system (e.g., Facebook Messenger, WeChat or WhatsApp)]

Remind ourselves with the "READ MF!" methodology.

Key designs and terms 

Let's first admit this is a hard problem. Youtube has hundreds of smart engineers keep improving it and it's almost impossible to design a good one in 60 mins. This solution is a pretty decent one from Grokking the system design interviews where it covers the backbone about the system.
You can also divided the system as usual into three major layers, presentation, service and data layer.

A few key terminologies. If you are not familiar, brush up a bit
  • CDN: a distributed cache for videos
  • Codec: encoding and decoding, a video would be encoded to different resolutions and different formats, e.g, 16:9 mp4, 4: 3 avi
  • Video processing is slow, normally done asynchronously via queue and workflows 
  • Video de-duplication, hash content (SHA, MD5), phase correlation, block matching. If you forget about Fourier Transformation, watch 3Blue1Brown
  • Metadata sharding, pros vs con. If shard videos by user uuid (hot partition for popular users). If shard videos by video uuid (hot partition for popular videos)
  • To solve hot hot partition / hot key: Consistent Hashing (original paper 1997: Consistent Hashing and Random Trees:Distributed Caching Protocols for Relieving Hot Spots on the World Wide Web)  + Load balancing + Replicate data (Need to automatically discover the hotkey first, redis has built-in support). Alibaba Cloud Read
  • Comment/Reply design, Likes and View counts (all sorts of counters) would be explained in a separate post.

 Baozi Youtube Video


References (Credits to original authors)

Saturday, March 28, 2020

Leetcode solution 190: Reverse Bits

Problem Statement 

Reverse bits of a given 32 bits unsigned integer.

Example 1:
Input: 00000010100101000001111010011100
Output: 00111001011110000010100101000000
Explanation: The input binary string 00000010100101000001111010011100 represents the unsigned integer 43261596, so return 964176192 which its binary representation is 00111001011110000010100101000000.
Example 2:
Input: 11111111111111111111111111111101
Output: 10111111111111111111111111111111
Explanation: The input binary string 11111111111111111111111111111101 represents the unsigned integer 4294967293, so return 3221225471 which its binary representation is 10111111111111111111111111111111.

Note:
  • Note that in some languages such as Java, there is no unsigned integer type. In this case, both input and output will be given as signed integer type and should not affect your implementation, as the internal binary representation of the integer is the same whether it is signed or unsigned.
  • In Java, the compiler represents the signed integers using 2's complement notation. Therefore, in Example 2 above the input represents the signed integer -3 and the output represents the signed integer -1073741825.

Follow up:
If this function is called many times, how would you optimize it?
Problem link

Video Tutorial

You can find the detailed video tutorial here

Thought Process

Use this problem to bring awareness of bit manipulation in interviews, it's rare but sometimes do get asked by some interviewers.

We can directly use Java's Integer.reverse() method, not suitable for interview purpose.

Simple simulation, reading bits from the right most and adding them up while shifting left.

Regarding the follow up, caching is an obvious solution. There are two ways to cache
  • Cache all the integers,  2^32 = 4294967296, total memory needed is 2^32*4B = 16GB roughly, easily handled by one server 
  • Cache each byte, divide the 32 bits into 4 bytes. Memory footprint is only 2^8*1B = 256B, significantly smaller than 16GB
If you are curious on how Java SDK implemented this method, it uses the algorithm in Hackers Delight book section 7-1
 1195       /**
 1196        * Returns the value obtained by reversing the order of the bits in the
 1197        * two's complement binary representation of the specified {@code int}
 1198        * value.
 1199        *
 1200        * @return the value obtained by reversing order of the bits in the
 1201        *     specified {@code int} value.
 1202        * @since 1.5
 1203        */
 1204       public static int reverse(int i) {
 1205           // HD, Figure 7-1
 1206           i = (i & 0x55555555) << 1 | (i >>> 1) & 0x55555555;
 1207           i = (i & 0x33333333) << 2 | (i >>> 2) & 0x33333333;
 1208           i = (i & 0x0f0f0f0f) << 4 | (i >>> 4) & 0x0f0f0f0f;
 1209           i = (i << 24) | ((i & 0xff00) << 8) |
 1210               ((i >>> 8) & 0xff00) | (i >>> 24);
 1211           return i;
 1212       }

Solutions

Simple simulation with follow up question

Time Complexity: O(N), where N is number of bits in the input integer
Space Complexity: O(1), no extra space needed 

References

Saturday, March 21, 2020

System design interview: how to design comments and reply, likes button and total views on Youtube

System design interview: how to design comments and reply, likes button and total views on Youtube

Methodology: READ MF!

[Originally from the Post: System design interview: how to design a chat system (e.g., Facebook Messenger, WeChat or WhatsApp)]
Remind ourselves with the "READ MF!" methodology.

This is a follow up on the previous post: System design interview: how to design a video platform (e.g, Youtube, Netflix)

Requirements

First, let's quickly about requirements. Likes and views are relatively straightforward, users can torllerate a bit delay and inaccuracy. For the majority case, as long as user clicks the button, it pluses one, then it's fine. Sometimes it's not the case if videos have too many likes, e.g., if a Youtube video already has 10K likes, you plus 1, it still shows 10K, it's just the UI tricks you that it's toggled.

For comments reply, there are a few different styles. The major ones such as Youtube, Instagram and TikTok uses following style. It displays comments (directly reply to video) based on order of likes and timestamp (descending) and any reply to comments are only 1 on 1, meaning A@B How are you, then B@A I am fine thank you and you? There is no more indentation needed.








Reddit uses the "block building" reply-to style (中文俗称"盖楼"), where it shows which reply replies to which reply, and it needs to show the indentations about those replies.





Estimation

For viral videos, say normally it has around 10M views
For likes, assuming 20% people liked a video, 10M * 20% = 2M likes
For comments, 1% people would leave a comment (we are lazy) 10M * 10% = 100K comments

For the majority normal videos, it would probably has 1000 views and 100 likes top and maybe 10 comments, a relationship DB could solve it pretty well.

 

Key designs and terms 

Comments design

If you start building your product, just bootstrap it with a relational DB
Introduce a comments table shard by video UUID, add a reply_to_uuid to know which comment the reply is targeted to and leave it null for root comment. Build an index on the reply_to_uuid

Select * from comments where reply_to_uuid is null order by comment likes desc, timestamp desc


If you need to see the replies to those comments, just

Select * from comments where reply_to_uuid is the_target_comment_uuid order by comment likes desc, timestamp desc

Even if your product becomes Youtube scale, the comments would be around 100K for viral videos, the above solution would still works fine. Simply add more capacity to better shard your comments using consistent hashing, cache the comments would do the trick.

If you need to build the Reddit tree structure, just sort it in memory. If the problem can fit into memory, it becomes much easier.

The extreme case is your comments section becoming a chat, then we can do something like an append only in memory DB or redis cache keep appending the values to the queue with async backup to DB.

Views and Likes count design

Similarly, when you bootstrap the project, keep a counter in DB or in memory cache solves your problem when traffic is low. If within one machine, you don't even need locks just use compare and swap (CAS), atomic operations for counting, thread safe.

If your product starts to become popular, add more capacity using consistent hashing. Add in memory cache like Redis to count the values (memory access time 100us vs disk access time 10ms. 100Kx improvement). Could be further optimized using distributed counter, aggregating the results together when read.

If you product becomes YouTube scale, then use offline counting. Build a pipeline to promote the videos from cold to hot/viral once the view counts hit a certain threshold (say 1M). Use async messaging like Kafka to ingest from those logs and pump it to data warehouse, query it and update the values on a cron schedule. Of course on the UI side, you need to toggle the like button, plus 1 if needed (Sometimes you would see a 100K likes video, even if clicked the like, the count would not be increased)

 Baozi Youtube Video

References (Credits to original authors)

Saturday, March 14, 2020

包子聊天系列 - Perf Review到底怎么回事儿?如何拿到好的Rating?

包子聊天系列 - Perf Review到底怎么回事儿?如何拿到好的Rating?

大家没事儿的时候当做podcast听吧,对于new grad新兵快速往上升职还是有帮助的。我们不是大佬,说的也不一定都对,但是我们确实是老呀,So show some respect son!😃😃😃😃



Tuesday, March 10, 2020

美国菜鸟买枪看家护院指南

【作者】 包子小旋风老师  【包子铺里聊IT 微信公众号
 
新冠肺炎美国各州新增案例,不少小伙伴动了买枪的念头。一个朋友今天了把散弹枪,问他什么gauge,他说gauge是什么,问他枪管多长,他说没仔细看...虽然国买枪虽然和买白菜差不多,但枪的参数因为用途,差别很大,这篇文章写第一次买枪想用来家防菜鸟
买枪需要什么资格?
如果你没有绿卡,可以买,但比较麻烦,你必须要有打猎证才可以买枪。获取打猎证需要经过hunter education,可以选择线下上课2,3天,或者线上课程和线下半天课,然后通过多选题的考试。大概是100题要对90题以上。
如果你有绿卡,你只需要获取Firearm Safety Certificate,你可以上网搜搜题库,然后打电话去你附近的枪店,说你想要考这个证,有时可以随时答题,有时要预约,去店里做多选题,大概是30题主要对25以上就通过了,题目是基本安全须知。
什么枪适合家防Home Defense?
枪一般分为手枪Pistol,散弹枪Shotgun和来复枪Rifle。其中来复枪大多用来长距离精确射击,用于家防的枪,目标很近,所以基本不用来复枪。合适的手枪和散弹枪都可以用来家防。但绝不是所有手枪或散弹枪都适合家防。
我们来比较下枪和散弹枪。手枪的优势小巧灵活,在被近身或者空间狭小的情况下散弹难以发挥。散弹枪的优势是瞄准容易射击稳定,因为散弹枪和有三个接触点(肩膀和两个手)比只有一个接触点(两个手放在一起)的手枪稳定很多,后坐力更容易控制。散弹枪的子弹是很多的小钢珠,命中率也会比手枪的单一弹头高很多。小旋风散弹枪打20米外的运动的飞盘的命中率,远远大于用手枪设计10米外固定的靶子
到底选哪种枪,非常建议找玩枪的朋友带你去靶场试试看,很大可能你会喜欢一种远大于另一种,买适合自己的枪,不光图个顺手,更重要的是你会更愿意去练习射击,而只有非常熟悉的枪和丰富的射击经验,关键时刻才真正能起到保护自己的目的。
具体选哪一款枪呢?
对于第一把枪,大多数菜鸟会选择一把便宜的,400-500之间。推荐买最popular的大品牌的枪,大品牌的枪可靠性更高。枪的品牌非常多,较熟悉的散弹枪品牌有Remington, Winchester, Benelli, Beretta, Browning, Mossberg, Ruger, Marlin等等,这些牌子的入门级shotgun都可以入手。包子君自用的是史上最Popular的Remington 870。在很多电影里都出现它,新手必备。
手枪的话,常见的品牌除了上面这些散弹枪品牌,还有Glock, Sig, Smirth&Wesson, CZ USA等等。其中最Popular的是Glock 19,口径9mm的。
枪有什么重要的参数要留意?
选定了一款散弹枪后,需要注意三个参数:第一个是口径也叫gauge或者caliber,代表枪管的直径,常见的是12gauge和20guage,12的子弹比20的粗,火药更多后坐力更大,钢珠更多,枪身也更重建议175以上的男生用12gauge,身材小一点的男生和女生,都适合20gauge。如果你让你的女票或老婆去打12gauge,那么从今以后去靶场,她肯定不会跟你一起了...第二个枪管的长度barrel length。默认的散弹枪的长度大多28寸,家防的应用场景,需要枪管更短才灵活一般用18寸左右的。如果你想回头拿去打猎,那你需要换个28-30的枪管来增加有效距离第三个参数是子弹的长度和钢珠的大小,最简单的方法是告诉你是用来家防的,让他推荐子弹,别自己去选,多半会选错。
手枪的参数较少,主要看Caliber,9mm是最流行的,它后坐力包子君觉得蛮大的...
考好了证,选好了枪,怎么去买?
去你附近的枪店(连锁的Sportsman Warehouse, BassPro, Big 5, Dicks或者gun range),带上你的real ID的驾照或者护照,绿卡或Visa,和上面说的证件就可以去挑选枪支了,选好后,要做背景调查(费用37.5),填很多的表格之后,过10天之后去店里取。和任何购物一样,枪也时常会有打折,你可以在这些店的网站上刷deal和评论。
有了枪,你就安全了吗?
绝对不是,你需要练习,练习,练习!千万不要觉得自己有了枪,遇到坏人拿出来就能把他吓走。你一定要去靶场射击练习至少10次。试想,坏人闯入你家,你迅速的拿到了枪,情况紧急,你如果不能2秒内架好枪,完成瞄准,随时可以射击来形成威慑,坏人很有可能已经射击或者冲到你的面前夺枪。射击和所有运动一样,多练习就会提高,如果你没有个1000发射击经验,遇到紧急情况,肯定无法发挥作用,还不如没有枪,可以去专心考虑逃跑或者别的方法脱困。湾区附近打飞盘和手枪的地方包子君推荐Sunnyvale Rod & Gun Club,这里的工作人员都是Volunteer,对新手非常友好。
其实啦,枪是蛮好玩的,当你打碎飞在空中的飞盘,或者击中靶纸的10环,都会觉得很爽枪既是武器,也是很好玩的玩具。有了枪,你会去关注枪的制作工艺,制作精良射击优美的枪,也是件简约而美丽的艺术品。而打猎,更是一种特别的了解自然的方式,付出辛苦徒步找寻猎物,再把他们做成佳肴,都是很棒的体验。当你入手了第一把枪,说不定你就会发现一项新的爱好。

包子粉丝们如果有什么问题,请在下方留言

新冠肺炎来袭, 视频onsite指南

【作者】 包子霹雳火老师  【包子铺里聊IT 微信公众号

这两天在北美最火的话题应该就是出门要不要戴口罩,以及今天要不要去公司上班了,小编连续在家work from home了两天就感觉得一个人上班是多么的孤单寂寞冷,无比想念大家一起在公司上班有说有笑,谈天谈地的日子。

随着新冠肺炎目前在北美有愈演愈烈之势,再加上蜜汁自信的美国政府跟毫无能力的基础医疗,很多大公司已经开始鼓励自己的员工work from home同时也有一些公司开始取消onsite面试来减少病毒传播的风险。

同一时间很多onsite面试也被各个公司用OA或者视频面试来取代。那么对于正在准备面试的小伙伴来说,这次疫情带来的影响是很大的,除了面试被取消的风险之外,还有全新的视频onsite的经历,这些都会给前期的面试准备带来一些不确定的因素。

在此,包子君根据以往长期的视频模拟面试经验,总结一些个人觉得有用的信息,希望给在准备的面试和即将要做视频onsite的小伙伴一些帮助。

  •  视频onsite如何更好地做自我介绍
    • 很多小伙伴可能把自己的简历准备得很是熟练,尤其是介绍past experience的时候很多时候在onsite中都可以通过画图的方式来体现自己的projectimpactcomplexity. 但是当onsite转到视频之后这些准备好的图可能一下就无法展现出来了,这时候小伙伴就应该换一种思维去想面试官想要从你的自我介绍中得到什么?
    • 自我介绍一般都是面试官给面试者一个熟悉环境,放松的时间,所以从面试官本身来讲更多是期待面试者能够自然地总结自己过去的经历。这个时候就需要面试者能够有以下的一些能力
      • High-level 讲述自己组或者自己负责serviceresponsibilitybusiness impact
      • 指出在整个system中或者整个project中自我的contribution. 这里的contribution不仅仅是写的code, 还可以是很多自己driveprocess
      • Detail-oriented to key components, 最后面试者要能够详细的讲出自己contribution里面的technical details。这一步一般在视频onsite的时候会被省略掉,但是如果面试官问起来,面试者还需要能够用最简单明了的语言去描述。
    • 总上所述,面试者在准备自我介绍的过程中,可以试着把自我介绍介绍给非CS的小伙伴,看看别人是否能大概知道你在公共做或者学校project里有什么样的作用.

  • 视频onsite如何更好地做online coding
    • 在视频onsite过程中,对于面试这最熟悉的应该就是coding的部分了但是跟普通onsite不同,在视频过程中你的表情,说话语气,包括神态都会放大般的展现在面试官的面前。所以在此,小编要给大家提醒一下几点
    • 不要自言自语,尤其是说中文。因为在小编过往模拟面试的过程中,有些面试者还时不时得自言自语两句中文,这在视频面试中是一个很大的red flag.
    • 跟面试官沟通,沟通,沟通!!!重要的事情说三次。Onsite面试的时候有些面试者喜欢那道题之后自己思考,但是在视频面试中,面试者一定要学会think loud and keep talking. 要让面试官知道你的想法是什么,你的思路是否正确,或者你stuck在哪里。对于面试官来讲,面试的过程是一个选择自己潜在同事的过程,而不是一个考试的过程,所以面试官注重的一点就是能否跟面试者有有效的沟通。
    • Take feedback. 在一般onsite面试过程中,如果面试者的solution有问题,面试官都会给一些hint或者一些feedback通过各式各样的交流。但是在视频面试过程中,面试官可能就会通过举反例或者提问的方式向面试者提供一些feedback,这时候面试者就应该去jump out然后去想这些feedback是否对你的solution有用.
    • 最后,还是要鼓励大家熟悉一些自己coding language. Online coding有些公司是需要跑code的,所以对自己的language一定要熟悉,小编见过很多人喜欢用python只是因为简单好写,但是到了online coding的部分,如果你选了你不熟悉的语言,不知道怎么debug,那么从面试官的角度来讲就是你technical skills不过关了。

  • 视频onsite如何更好地做system design
    • 对于System design的问题视频onsite确实会有很大的挑战。首先面试者没有自己擅长的白板,这样很多架构图就无法展现出来。这时候对于面试者的挑战就变成了如何清晰地展现自己处理问题的思路和如何利用好bullet point来表达自己的观点了。
    • 通常来讲,面试者先要把system designrequirement清晰地记录下来,不明确的要去跟面试官沟通,把需求确定好。
    • 其次就是要把整个system分成不同的部分,这时候就要用bullet point把自己的思维一步步得展现给面试官。就当自己在写ppt的概要
    • 做完上述两步之后,面试者就可以跟面试官讨论然后着重挑其中的一个componentdetail design. 比如db schema, API signature 然后customer use scenario , error handling. 这些都可以用简单的statement写好
    • 最后面试者需要做出一些summary的总结,列举出pros and cons.向面试官展现出一个完整的设计思路

  • 视频onsite如何更好地做behavior question
    • 对于behavior 来说,应该是对于视频onsite来说最直观的了。重要的一点就是管理好自己的表情,表达方式,毕竟onsite面试官看到的是一个整体的人,而在视频onsite中看到的只是面试者的脸。所以任何小的问题都会被放大。

最后小编希望新冠肺炎早日结束,至少希望迷之操作的美国政府能想我们大中国一样拿出点有力地防护措施。也希望大家早日找到心仪的工作。

包子粉丝们如果有什么问题,请在下方留言