Wednesday, July 24, 2019

Leetcode solution 258: Add Digits

Problem Statement 

Given a non-negative integer num, repeatedly add all its digits until the result has only one digit.
Input: 38
Output: 2 
Explanation: The process is like: 3 + 8 = 11, 1 + 1 = 2. 
             Since 2 has only one digit, return it.
Follow up:
Could you do it without any loop/recursion in O(1) runtime?
Problem link


Video Tutorial

You can find the detailed video tutorial here

Thought Process

Just do a simple brute force simulation. Unless you know what a digital root is, beforehand solving it in O(1) cannot collect any useful signals from the candidate.

The triple bar equal is quite interesting. It means congruence. E.g., if you have two exactly same figures but one is rotated or clock points to 13 and 1, they are considered congruence (just like 13 and 1 mod by 12 are equal). In our example,  it's about the mod result, e.g.,

Side note:
I am sharing with everyone on this problem just to express how useless this problem is... Honestly if someone ask you to solve this problem in O(1) time in a real interview, rethink joining that company. If you are an interviewer, please do yourself a favor not asking your candidate to solve it in O(1). It's an okay question to ask as a warm up just to calm your candidate down for the brute force solution. 


Brute force

Keep summing the each digit until the final number is < 10

Time Complexity: O(N), N is the length of the digit
Space Complexity: O(1), No extra space is needed

Use the digital root formula

                                              return (num - 1) % 9 + 1
Time Complexity: O(1)
Space Complexity: O(1)


Sunday, July 21, 2019

分布式Streaming Data Processing - Samza


现在的主流的互联网应用越来越依赖streaming data来提供用户一些interesting statistics insights。以linkedin为例,最近90天有多少人看过你的linkedin profile。看过你profile的人都是什么job title,他们都在那些公司工作。如下图,你应该如何实现这个功能呢?

相信大家都听说过page view event,就是用户每次打开网站上的某个页面发出来的tracking event,各个大公司一般用这些event来做一些统计分析,business analysis。大家一般会利用一些吞吐量大的分布式消息系统来存储这些event,例如kafka。这是因为对于一些popular的网站,每天可能会有上亿或者10亿的page view event。我们可以利用对这个event的处理来实现我们之前提到的功能。

通常有两种方法可以实现以上的功能,一个是通过hadoop map reduce job,或者更抽象的hive pig query来实现这样的统计功能。但是这个方法有一个明显的劣势,就是处理速度慢,很难做到事实更新。对于我们以上的功能要求或许这个方法没有任何问题,因为我们只关注过去90天的统计信息而且不要求显示当天信息。但是今天我们要探讨另一个实现方法,利用多streaming data processing做到实时统计更新。其实有好多功能是需要事实更新的,例如search index update,twitter或者facebook一些hot topic/trent的发现。
  1. Stream Data Repartition

我们可以通过对streaming data的repartition来实现同一个用户的page view events都聚集到了同一个机器上去处理,这样我们可以做到每个用户的统计数据都是准确的。这个功能基本所有主流的streaming data处理框架都支持,例如,kafka + samza,aws kinesis,storm。

  1. Streaming Data Join

我们可以看到我们需要根据viewer的职位名称或者公司名称来做统计,但是我们的page view event只有viewer的id,没有职位或者公司这些信息,那我们改怎么实现呢?

一个非常简单的思路就是让我们的streaming processor去call profile的api来拿到职位或者公司名称的信息。这样子做有几个非常明显的劣势。1. 如果streaming processor停止工作半个小时或者更长时间,在重启streaming processor的时候由于积累了大量的未处理的events,streaming processor会flood我们之前说过的profile api。2. Streaming processor每次通过network来call另外一个api会增加额外的latency。3. 很难做到online和offline的isolation,因为这个统计功能还是属于offline或者nearline data processing,我们不希望因为这个功能影响了用户查询或者修改profile信息。比如第一个case发生的时候。

另一个思路就是可以加cache,来cache profile的查询request。但是这样子也有一个劣势,如果TTL设的很大,很难做到cache的数据是事实更新的,如果TTL设的特别短,cahe又基本不起什么作用,而且增加额外的network cost。

这里我们介绍一个samza引进的一个新功能,stream joining。我们可以join page view event和profile edit event,然后解决以上两个方案的劣势。我们的stream processor需要同时听两种events(PageViewEvent and ProfileEditEvent),然后对这两种event进行同样的partition both by viewer id,对于profile edit events,我们可以在stream processing机器上建立一个小的数据库来存储profile的实时数据,这样子我们可以对viewer进行快速查询来enrish page view event with viewer job title和company information。然后我们再将enriched的page view event重新partition by user id。然后进行统计。这样子我们就做的了profile数据的isolation,也解决了network call的latentcy cost。

Sunday, July 14, 2019

Leetcode solution 291: Word Pattern II

Problem Statement 

Given a pattern and a string str, find if str follows the same pattern.
Here follow means a full match, such that there is a bijection between a letter in pattern and a non-empty substring in str.
Example 1:
Input: pattern = "abab", str = "redblueredblue"
Output: true
Example 2:
Input: pattern = pattern = "aaaa", str = "asdasdasdasd"
Output: true
Example 3:
Input: pattern = "aabb", str = "xyzabcxzyabc"
Output: false
You may assume both pattern and str contains only lowercase letters.
Problem link


Video Tutorial

You can find the detailed video tutorial here

Thought Process

It is quite similar to Word Pattern and Isomorphic String problem, where we would keep a mapping from char to a string while also ensure there would be a one to one mapping, i.e., bijection mapping. The tricky part is it seems there are way many combinations of the mapping, how can we efficiently solve them?

Maybe we could list all the combinations? Maybe we could use DP since it is string related and only ask for true/false result?

How to list all combinations? Think about this way, let's say you have pattern = "aba" and str = "redbluered", since one char in pattern can map to any string length >= 1 in str, it is equivalent to divide up str into 3 parts (length of pattern) and check all cases. For instance, the cut of the words is like below:
  1. r | e | d b l u e r e d
  2. r | e d | b l u e r e d
  3. r | e d b | l u e r e d
  4. r | e d b l | u e r e d
  5. r | e d b l u | e r e d
  6. r | e d b l u e | r e d
  7. r | e d b l u e r | e d
  8. r | e d b l u e r e | d
  9. r e | d | b l u e r e d 
  10. r e | d b | l u e r e d 
  11. r e | d b l | u e r e d  
  12. r e | d b l u | e r e d 
  13. r e | d b l u e | r e d  
  14. r e | d b l u e r | e d 
  15. r e | d b l u e r e | d  
  16. r e d | b | l u e r e d  
  17. ..... 
 In general, if the length of pattern is M, the str length is N, the time complexity of this brute force method is O(N^M), more accurately, it should be


DP solution does not work since we cannot easily get a deduction formula :(


Brute force list all the combos

For each character in pattern, try to map any possible remaining strings in str from length 1 to the end. During this process, need to make sure the string mapping is bijection (no two chars in pattern map to the same string in str) and if a mapping has been seen before, continue use that mapping.

A DFS recursion would be the implementation. A few caveats in implementation
  • Remember to reset the map and set after recursion returned false
  • When there is a bijection mapping, should continue instead of directly break

Time Complexity: O(N^M), or C(N^M) to be exact. Pattern length is M, str length is N
Space Complexity: O(M), Pattern length is M, str length is N. We use a map and a set to store the lookup, but at one time, the map should not exceed the pattern size, so is the set


Saturday, July 6, 2019

Leetcode solution 205: Isomorphic Strings

Problem Statement 

Given two strings s and t, determine if they are isomorphic.
Two strings are isomorphic if the characters in s can be replaced to get t.
All occurrences of a character must be replaced with another character while preserving the order of characters. No two characters may map to the same character but a character may map to itself.
Example 1:
Input: s = "egg", t = "add"
Output: true
Example 2:
Input: s = "foo", t = "bar"
Output: false
Example 3:
Input: s = "paper", t = "title"
Output: true
You may assume both and have the same length.
Problem link


Video Tutorial

You can find the detailed video tutorial here

Thought Process

A relatively straightforward problem, very similar to Word Pattern. All we have to do is check the one to one mapping from string a to string b, also it needs to maintain a bijection mapping (meaning no two different characters in a should map to the same character in b)

Use bijection mapping

Check character one by one from a and b. If char in a hasn't been seen before, create a one to one mapping between this char in a and the char in b so later if this char in a is seen again, it has to map to b, else we return false. Moreover, need to make sure the char in b is never mapped by a different character.
An explanation int the video


Time Complexity: O(N), N is the length of string a or string b
Space Complexity: O(N), N is the length of string a or string b because the hashmap and set we use