Sunday, August 31, 2014

劳动最光荣!Double Letter Problem from TopCoder



包子IT面试培训 助你拿到理想的offer!

这是TopCoder上面一道比较简单的250分的题目,非常适合作为面试的第一道热身题目,考察了面试者对于基本的数据结构是否有所了解,如果简单暴力模拟的话,就不是很好。


话不多说,先看一下题目。


Problem Statement
   
You are given a String S. You can modify this string by repeating the following process:
Find the leftmost occurrence of two consecutive letters in S that are the same.
If you found a pair of identical letters in the first step, delete those two letters from S.
For example, if S="aabccb", you can proceed as follows:
Find and erase "aa", producing the string "bccb".
Find and erase "cc", producing the string "bb".
Find and erase "bb", producing the empty string.
For S="axxyybac" you can do at most two steps, erasing "xx" first and "yy" next. Once you obtain the string "abac", you are done. Note that you cannot erase the two "a"s because they are not consecutive. You want to change S into an empty string by repeating the operation described above. Return "Possible" if you can do that, and "Impossible" otherwise.
Definition
   
Class:
DoubleLetter
Method:
ableToSolve
Parameters:
String
Returns:
String
Method signature:
String ableToSolve(String S)
(be sure your method is public)
Limits
   
Time limit (s):
2.000
Memory limit (MB):
256
Constraints
-
S will contain between 1 and 50 characters.
-
Each character in S will be a lowercase English letter ('a'-'z').
Examples
0)


   
"aabccb"
Returns: "Possible"


1)


   
"aabccbb"
Returns: "Impossible"
The process will terminate with a single 'b'.
2)


   
"abcddcba"
Returns: "Possible"
"abcddcba" -> "abccba" -> "abba" -> "aa" -> "".
3)


   
"abab"
Returns: "Impossible"
No two successive letters are the same, so we can't do any operation.
4)


   
"aaaaaaaaaa"
Returns: "Possible"


5)


   
"aababbabbaba"
Returns: "Impossible"


6)


   
"zzxzxxzxxzzx"
Returns: "Possible"


This problem statement is the exclusive and proprietary property of TopCoder, Inc. Any unauthorized use or reproduction of this information without the prior written consent of TopCoder, Inc. is strictly prohibited. (c)2003, TopCoder, Inc. All rights reserved.
就像我们说的,如果面试者可以活用Stack,题目就变得非常容易,比简单的暴力模拟好很多,所以大家在面试的时候要因题而异,找准应该使用的数据结构,就会事半功倍。

代码如下






Wednesday, August 27, 2014

[LinkedIn, Twitter and Hulu Onsite] How to design and implement a deep iterator?


包子IT面试培训 助你拿到理想的offer!

Question:
Write a deep iterator to iterate through a list of objects or integer which could be another list or integer. This is frequently asked by LinkedIn, Twitter and Hulu. 


 For example, this collection contains Integer or another collection. L means it is a collection that contains either integer or another L.

  ----> 5
                                     |
               ---- 3 -> 4  -> L  -> 6
              |
1 -> 2 -> L -> 7-> 8

We would expect an iterator to loop through it will print out 1, 2, 3, 4, 5, 6, 7, 8



Analysis:
The input is a list of objects which could be another list or a integer. The input is kind of similar to a tree or a forest data structure. Here we are asked to implement two functions of a iterator, hasNext and next. (Iterator in Java has hasNext(), next() and remove(). ) So we need to iterate through the tree. Using recursion it is easy, but here we need to convert a recursion problem to a iteration problem. So we need a stack to keep track of the levels we are in the iteration. The code is like following.

You could use the following code to test it.

Sunday, August 24, 2014

脑筋不转弯啦,程序猿和攻城狮面试Brain Teaser总结(二)




Brain teaser,顾名思义,脑筋急转弯的意思,但同时这类问题往往还要求一定程度的数理基础,需要作出合理的逻辑推理和演绎,并且能够完成相对准确完整的表达,难度并不低。另外,Brain teaser的存在其实是有一定争议的,大部分业内人士认为这类题目并不能完全考察一个人的聪明程度,因为『是否之前见过类似题目』非常影响答题效果,很多公司已经放弃使用brain teaser作为面试题目。但根据小编的自身经历,还是有很多公司在电话面试中会问一道brain teaser的题目,通常作为编程题目之后一道小菜,目的是看看应试者的沟通表达能力和解决问题的思路。由于是小菜,一般来说做不出brain teaser不会特别影响最终面试结果,但是做的比较好确实容易给自己加分,因此我们来简单看看下面几个套路brain teaser.


1. 有一项体育锦标赛,淘汰赛制,有64位选手参加。两两配对比赛分出胜负(被淘汰选手不再比赛),然后胜利的选手进入下一轮淘汰赛,如此进行直到决出冠军。请问这届锦标赛一共发生了多少场比赛?(如果你的面试官是个体育爱好者,没准就会那这个来试试你)


思路:
Brain teaser的题目往往比较长,因此如果没有听懂的地方一定要确认或者请别人重复一遍。弄清题目要求之后可以问问相关问题,比如这里的64是2整次方,需不需要考虑非2的整次方数的问题(往往面试官会说可以之后考虑,但是先做目前这个条件)。


进一步解决问题,最直观的解法当然是一轮一轮的数,64进32的时候有32场,32进16有16场...循环加起来32+16+8+4+2+1 = 63场;这个方法行得通,但是显得比较笨拙,如果选手是111人,多余的人轮空本轮,那么挨轮数的办法并不好用。


对于这类问题,往往最重要的是挖掘问题的本质。问题是统计比赛的场数,比赛的场数又和比赛的人数直接相关。但是这个关系可以更深入的解读:一场比赛有两名选手参加,需要淘汰一名选手。换一种想法,每淘汰一名选手,需要一场比赛。那么n个选手决出冠军,自然需要淘汰n-1名选手,因此需要n-1场比赛。问题解决。


点评:


对这类问题,给出一个答案不难,难的是快速的看透复杂问题的本质,然后在最接近问题本质的关系上,多角度思考,多问自己几个为什么,往往会有启发。


2.  1000瓶酒,已知有且仅有1瓶有毒。酒中的毒素在进入机体内10个小时之后发作致死。目前你有若干只小白鼠可做你的实验品,但是你只有11个小时的时间,请问最少需要多少只小白鼠才能确定哪瓶是有毒的酒?


思路:听懂问题仍然是第一要务,如果有不清楚的地方一定要及时问面试官,这是考察沟通能力的一部分。比如『一只小白鼠可以喝多瓶酒么?』『可以』;『多只小白鼠可以喝一瓶酒么?』『可以』等等。


题目的本质是什么?


仔细想想,用一句话来表达的话,那么:由于11/10个小时的限制,那么其实这个题目本质上就是希望在一轮酒分配之后,通过小白鼠的生死状况来判断哪瓶酒是毒酒。


最直观的当然是用1000只小白鼠,10小时候死的那只小白鼠喝的酒是毒酒。(或者999只,如果都没死,那么所有小白鼠没喝的那一瓶是毒酒),仍然是可行,但是笨拙,没有利用到小白鼠喝多瓶酒的组合,显然会有更优解。


回到我们之前得出的问题本质,最终的最终,我们需要用小白鼠的生死来判断结果。那么小白鼠最后的状态其实非生即死,想的快的同学可能马上反应过来,这不就是1和0两个状态嘛。那么,1只小白鼠可以有2个状态(生、死),2只小白鼠可以有4个状态(1生2生、1生2死、1死2生、1死2死),归纳演绎,10只小白鼠就可以有2的10次方个状态,即1024个状态,完全可以覆盖1000瓶酒,似乎解决了问题!


答案是对的,但是我们怎么想出一个安排小白鼠喝酒的方法,能使最后10只小白鼠的生死状态对应一瓶酒呢?


我们来看一个10位的2进制的数字:


1 0 0 0 0 0 0 1 1 1 => 对应了所有处于”1”位的小白鼠死去、其他位的小白鼠活着的一个结果,这个结果转换成10进制就是”519”这瓶酒,一一对应。


进一步讲,每一只小白鼠的生死应该能确定这个2进制数中每一位的状态(0/1)。怎么做呢?拿最高位的那一只小白鼠来举例子,让它喝所有最高位上为1的瓶号的酒,(1 *********), 共有512瓶,如果它最后死了,那么这一位一定是1,否则是0;对其他位的小白是使用相同的方法,即可得解。


点评:


对于数字的敏感是解决这个问题的关键,当你看到『生死』这种二元的状态和1000这个数的时候,IT界的朋友应该马上可以联想到2的10次方是1024,可以完全覆盖1000个状态。最后逆推结果的时候,可以适当用纸币勾画一个具体例子,然后归纳演绎,问题得解。

Thursday, August 21, 2014

How to design a basic web crawler?



包子IT面试培训 助你拿到理想的offer!


This is a frequently asked design question in interviews, not necessarily for Google or Bing. Many other companies will still ask you this question since they maybe have a search engine internally. This question will test your basic understanding of computer science fundamentals.

Keep in mind a production web crawler could be very sophisticated and normally takes a few teams weeks/months to develop. The interviewer would not expect you to cover all the detail, but you should be able to mention some key design perspectives.

How to abstract the internet?
You should quickly realize the internet could be abstracted as a directed graph, with each page as a node and hyperlinks as an edge.

How to crawl?
BFS is normally used. However, DFS is also used in some situation, such as if your crawler has already established a connection with the website, it might just DFS all the URLs within this website to save some handshaking overhead.

Decide what you want to crawl?
Internet is huge thus your graph is huge. It is almost impossible to crawl the entire internet since it is keep growing every sec. Google probably has 60 trillion while Bing has 30 trillion, roughly speaking.

Some strategy used in practice includes

a) Having a few websites in your crawler’s most frequent list to crawl, such as some authoritative news website, etc

b) You should have lots of fetchers living on many host classes. Use machine learning to predict which websites are most likely to have frequent update, put those into the fetchers’ priority queue.

Hot to track what your fetchers have crawled?
You don’t want your fetchers to crawl some website over and over again while other websites don’t get crawled at all. There are many ways to achieve this. For example, your scheduler should generate non-duplicate jobs for fetchers. Or, your fetchers could keep track off the last visited time for each URL. Note, due to the scalability, a consistent hashmap is needed.

Respect the standard?
If a crawler sees a robots.txt on a site, it should skip crawling it. However, if this happens a lot for a fetcher, it is doing less work on the run. Some optimization could be let the fetcher pick some other sites to crawl to reduce the overhead of spawning a fetcher.

Sometime, URL on a website is not easy to extract. For example, some URLs are generated by javascript. You need to have some ways to execute those javascript and get the URL.


[Reference]

Saturday, August 16, 2014

Java继承(Inheritance)与多态(Polymorphism)一瞥

包子IT面试培训 助你拿到理想的offer!


Java是一种OOP的语言,在OOP语言的基本特性中,继承(Inheritance)与多态(Polymorphism)经常是面试官考察 的核心,在很多同学的面试过程中,往往被问到的都是他们的基本感念,这也是很容易回答的。但是作为程序 员,我们需要深入的了解这些关系而不仅仅是停留在字面上的理解。包子今天向广大的Java爱好者讲解一下继 承(Inheritance)与多态(Polymorphism)关系在Java中的表现方式。

 请大家先看下面这段程序:

 上面这段程序很简单, 我们首先定义的一个基本类叫做TestClass,在TestClass中有两个int的变量,分别是a,b. 并 且在TestClass里面还有一个print()的函数去将当前的Class跟a,b打印出来。 然后我们在TestClass的构造函数中 分别对a,b 赋值。并且调用print()函数。 接下来,我们定义了另外一个类叫做TestClass2, 这个TestClass2是extends TestClass,在TestClass2中我们 override了函数print() 。然后TestClass2的构造函数中,我们首先调用了父类的构造函数super(),然后我们对 a,b重新赋值,并且调用函数print().


 下面是这段程序的运行结果:
 class TestClass: a:5, b:6
 class TestClass2: a:5, b:6
 class TestClass2: a:1, b:2

 细心的同学可能已经发现在运行结果中,class TestClass2 出现了两次,或许有人会问,为什么在TestClass2在调 用super()的时候,明明是调用TestClass中的print但是我们的结果却是class TestClass2. 这是因为当我们创建这个 类的时候我们已经声明了当前这个子类的类型是TestClass2, 因此无论调用父类的任何function, this.getClass()永远 在runtime的时候指向当前的子类。 所以我们得到的结果是TestClass2.


 下面我们再看一个例子:

 我们又定义了一个新的类TestClass3, 这个TestClass3是extends TestClass2的。
在TestClass3的构造函数中只调用
 class TestClass: a:5, b:6
 class TestClass2: a:5, b:6
 class TestClass2: a:1, b:2
 class TestClass3: a:5, b:6
 class TestClass3: a:1, b:2
 class TestClass3: a:1, b:2 

看到上面显示的运行结果是不是很多人都会感觉到吃惊,因为觉得在TestClass3中,我们并没有调用任何super() 之类的function, 为什么TestClass3的结果是这样的: class TestClass3: a:5, b:6 class TestClass3: a:1, b:2 class TestClass3: a:1, b:2 这是因为在Java instantiate 一个sub class的时候,super class的constructor 同样也会执行。即使我们在sub class的constructor中没有明确的指出super(),但是java的编译器会将super()自动地放在sub class constructor 里面 的第一个statement执行。这就是为什么我们会看到TestClass3自动的调用了super(). 


最后,我们再看一个有意思的例子:

 在这里,我们在TestClass的constructor里面取消了对a,b的赋值。 然后print()函数还是要将当前的a,b至打印出 class TestClass: a:0, b:0 class TestClass2: a:0, b:0 class TestClass2: a:1, b:2 class TestClass3: a:0, b:0 class TestClass3: a:1, b:2 class TestClass3: a:1, b:2 

到这里,可能很多不是学java的人会感到很奇怪,为什么对一个以没有initialization的变量,我们依旧能print出 来结果。这其实是考察到了Java本身的一种性质, 因为在Java里面,所有的东西都是一个类(Class), 而在这些类 的最高层,我们可以看成是Java的root,存在着一种特殊的类叫做Object. 而TestClass其实也是一种Object。在 Java里面,int 变量的初始值就会在Object里面被赋值成0. 所以即使我们看上去是没有赋值的变量(a&b), 它们本 身其实还有有初始值的,那就是0 上面就是包子为大家总结的Java编程过程中的继承与多态的关系。如果大家有什么为题,可以给我们的公共邮 箱发邮件进行提问。

Wednesday, August 13, 2014

关于String Edit Distance问题的总结


包子IT面试培训 助你拿到理想的offer!


[例题1]
找到一个字典中与当前输入string的edit distance [1],(edit distance通常指最小的edit distance,即从一个单词通过add,delete, replace变成另一个单词所需要的最小步骤数),为1的词
[思路]
最简单的方法就是把输入的string和字典里每个词比较edit distance,如果是一就返回
比较好的edit distance算法要求n^2时间复杂度  如果n是两个字符串的长度
这样假设字典有m个词,那总时间复杂度就是m*n^2,非常慢


我们通常想到的string matching against一个string set的方法是给string set建立trie。这道题不能直接用这种方法,因为我们要求edit distance为1。实际上,edit distance为1就是允许trie里的string有1个字符和输入字符不匹配。这种不匹配既可以是字典里的词多了一个letter,可以是输入的string多了一个letter,也可以是这两个词有一个letter不一样。对于这道题来说,依然为dict建立一个trie,依然去匹配输入的string,在匹配时(只)允许有一个字符不匹配,然后比较输入string和字典里的每一个词,这样在trie里就可以找到所有edit distance为1的词
实现时我们借用通配符的概念,如果两个string已经有一个letter不一样,那就用掉了这个通配符,这时如果还有不匹配的letter,那就不用继续比较当前两个词了


[例题2]
找到一个字典中与当前输入string的edit distance小于k的词,通常用于文档中拼写的自动纠正当中。
[思路]
这道题看起来很像例题一,但是如果也用通配符的办法不会符合题意,因为edit distance一般指最小distance, 否则就没有意义了。如果这道题也用通配符的话,有可能重复操作一个字符(比如加上a,再减去a)
Three ways to search for minimum edit distance in a dictionary:


1. Naive approach
The obvious way of doing this is to compute the edit distance from the query term to each dictionary term, before selecting the string(s) of minimum edit distance as spelling suggestion. This exhaustive search is inordinately expensive.
Source: Christopher D. Manning, Prabhakar Raghavan & Hinrich Schütze: Introduction to Information Retrieval.


The performance can be significantly improved by terminating the edit distance calculation as soon as a threshold of 2 or 3 has been reached.


2. Peter Norvig
Generate all possible terms with an edit distance <=2 (deletes + transposes + replaces + inserts) from the query term and search them in the dictionary.
For a word of length n, an alphabet size a, an edit distance d=1, there will be n deletions, n-1 transpositions, a*n alterations, and a*(n+1) insertions, for a total of 2n+2an+a-1 terms at search time.
Source: Peter Norvig: How to Write a Spelling Corrector.


This is much better than the naive approach, but still expensive at search time (114,324 terms for n=9, a=36, d=2) and language dependent (because the alphabet is used to generate the terms, which is different in many languages and huge in Chinese: a=70,000 Unicode Han characters)


3. Symmetric Delete Spelling Correction (FAROO)
Generate terms with an edit distance <=2 (deletes only) from each dictionary term and add them together with the original term to the dictionary. This has to be done only once during a pre-calculation step.
Generate terms with an edit distance <=2 (deletes only) from the input term and search them in the dictionary.
For a word of length n, an alphabet size of a, an edit distance of 1, there will be just n deletions, for a total of n terms at search time.


This is three orders of magnitude less expensive (36 terms for n=9 and d=2) and language independent (the alphabet is not required to generate deletes).
The cost of this approach is the pre-calculation time and storage space of x deletes for every original dictionary entry, which is acceptable in most cases.


The number x of deletes for a single dictionary entry depends on the maximum edit distance: x=n for edit distance=1, x=n*(n-1)/2 for edit distance=2, x=n!/d!/(n-d)! for edit distance=d (combinatorics: k out of n combinations without repetitions, and k=n-d),
E.g. for a maximum edit distance of 2 and an average word length of 5 and 100,000 dictionary entries we need to additionally store 1,500,000 deletes.


Remark 1: During the precalculation, different words in the dictionary might lead to same delete term: delete(sun,1)==delete(sin,1)==sn.
While we generate only one new dictionary entry (sn), inside we need to store both original terms as spelling correction suggestion (sun,sin)


Remark 2: There are four different comparison pair types:


dictionary entry==input entry,
delete(dictionary entry,p1)==input entry
dictionary entry==delete(input entry,p2)
delete(dictionary entry,p1)==delete(input entry,p2)
The last comparison type is required for replaces and transposes only. But we need to check whether the suggested dictionary term is really a replace or an adjacent transpose of the input term to prevent false positives of higher edit distance (bank==bnak and bank==bink, but bank!=kanb and bank!=xban and bank!=baxn).


Remark 3: Instead of a dedicated spelling dictionary we are using the search engine index itself. This has several benefits:


It is dynamically updated. Every newly indexed word, whose frequency is over a certain threshold, is automatically used for spelling correction as well.
As we need to search the index anyway the spelling correction comes at almost no extra cost.
When indexing misspelled terms (i.e. not marked as a correct in the index) we do a spelling correction on the fly and index the page for the correct term as well.
Remark 4: We have implemented query suggestions/completion in a similar fashion. This is a good way to prevent spelling errors in the first place. Every newly indexed word, whose frequency is over a certain threshold, is stored as a suggestion to all of its prefixes (they are created in the index if they do not yet exist). As we anyway provide an instant search feature the lookup for suggestions comes also at almost no extra cost. Multiple terms are sorted by the number of results stored in the index.


Reasoning
In our algorithm we are exploiting the fact that the edit distance between two terms is symmetrical:


We can generate all terms with an edit distance <2 from the query term (trying to reverse the query term error) and checking them against all dictionary terms,
We can generate all terms with an edit distance <2 from each dictionary term (trying to create the query term error) and check the query term against them.
We can combine both and meet in the middle, by transforming the correct dictionary terms to erroneous strings, and transforming the erroneous input term to the correct strings.
Because adding a char on the dictionary is equivalent to removing a char from the input string and vice versa, we can on both sides restrict our transformation to deletes only.
We are using variant 3, because the delete-only-transformation is language independent and three orders of magnitude less expensive.


Computational Complexity
Our algorithm is constant time ( O(1) time ), i.e. independent of the dictionary size (but depending on the average term length and maximum edit distance), whereas e.g. BK-Trees have a search time of O(log dictionary_size).


[例题3]
找到一个字典中所有edit distance==k的词的pair
[思路]
这道题可以利用例题2中的思路,先把每个词任意删除k/2个字符的所有可能记录下来,再看那些词的这些变化有重复


[Reference]
Edit distance:

Saturday, August 9, 2014

How to calculate “hard” runtime complexity in recursive solution?


包子IT面试培训 助你拿到理想的offer!

在技术面试中,准确说出一个解法的runtime complexity(算法时间复杂度)是一个非常重要的点。考虑到对于算法时间复杂度的理解是CS领域的基础,因此这类问题,回答对了往往不加分,但是回答错误往往是致命的,因此大家不能掉以轻心。


Note: 本篇只讨论算法的时间复杂度,不涉及算法的空间复杂度。:)


对于基本的算法复杂度分析,Big O notation是必须要掌握的,详情请看wikipedia相关资料:http://en.wikipedia.org/wiki/Big_O_notation 。 简单的说,Big O描述了当输入(input)的复杂度线性增长时,一个算法的计算时间复杂度的增长变化。举个例子,如果我们说一个算法的是时间复杂度是O(n),那么意味着当该算法的输入线形增长时,其计算时间也成线性增长。


『简单的时间复杂度计算』


大部分非递归算法的时间复杂度计算比较简单,一句话简单表述,就是数有多少层嵌套循环即可,(每个循环的次数和算法的输入n相关)。比如一个问题描述『求n个int数中的最大数』,那么解法中需要用一次循环来遍历所有input(即n个数),然后返回结果。由于只需要一层循环,那么该算法时间复杂度为O(n).


『相对复杂的时间复杂度计算』


当一个算法中既包含循环(loop)又包涵递归(recursive)的时候,算时间复杂度可能需要动动脑筋。这部分也是本章重点。:)


用最近的一道高频题做例子:


Problem:


Given a string s and a dictionary of words dict, add spaces in s to construct a sentence where each word is a valid dictionary word.  Return one such possible sentences.


s = "catsanddog", dict = ["cat", "cats", "and", "dog"], return “cats and dog";


(Leetcode word break II 简化版)


Clarification


Q: What to return if s can’t be splitted into words in dict? A: Return NULL;
Q: What if there are multiple ways of splitting s? A: Return one split solution is fine;
Q: What if S is null/empty string? A: Return NULL;
So on and so forth..


Code in first pass:
 public String wordBreak(String s, Set<String> dict) {
   if (s == null || s.isEmpty()) {
     return s;
   }
   List<String> solution = new ArrayList<String>();
   boolean ret = wordBreakHelper(s, dict, solution);
   if (!ret) {
     return null;
   }
  
   //from apache commons lang
   return StringUtils.join(solution, " ");
 }
 
 private boolean wordBreakHelper(String s, Set<String> dict, List<String> solution) {
   if (dict.contains(s)) {
     solution.add(s);
     return true;
   }
   for(int index = 1; index < s.length(); ++index) {
       String left = s.substring(0, index);
       if (dict.contains(left)) {
         solution.add(left);
         boolean ret = wordBreakHelper(s.substring(index), dict, solution);
         if (ret) {
           return ret;
         } else {
           solution.remove(solution.size() - 1);
         }
       }
    }
    return false;
 }


这样的解法看起来一目了然,对于每一个字符串,分成左右子字符串,查左子串、递归右子串(想想为什么不需要递归左子串)。


那么这个算法的时间复杂度是多少呢?


这个复杂度的计算并非那么容易一眼看出,因为解法中的循环内有递归,而每一个递归中又存在循环。所以数嵌套循环数的方法并不好用。因此我们拿一个具体的例子来看:


假设字符串s是abcde; 那么算法变化如下:
当index = 1, 字符串分为左串a和右串bcde,如果a不在字典中,那么这个拆分不成立,情况简化;如果a恰好在字典中,那么算法有两种选择,一种选择是将a放入结果集,然后递归bcde,另一种是不选择a作为完整单词,而是作为单词的一部分去试探下一个index。


当index = 2, 由于index = 1时产生了两种情况:a bcde 和 abcde。


对于a bcde的情况,由于算法递归bcde,和index = 1的情况类似,如果b在字典中,那么会演化出两种情况:a b cde; a bcde;
对于abcde的情况,如果ab在字典中,那么会演化出两种情况:ab cde; abcde;


写到这里,大家应该看得出来,对于每一个index上的字母,是否选择该字母作为单词结束的决定会产生两种可能,而且是层层叠加。因此该算法的复杂度就是2的n次方。


Code in second pass:


上面的写法虽然思路简单,但是不可避免的进行了很多重复运算。比如cde这个单词至少会被a b cde 和ab cde这两种拆分方法共同计算,如果我们能利用一个额外的集合记录之前的计算结果,那么算法的复杂度会大大下降。


修改代码:


 private boolean wordBreakHelper(String s, Set<String> dict, Set<String> unsplitableSet, List<String> solution) {
   if (dict.contains(s)) {
     solution.add(s);
     return true;
   }
   if (unsplitableSet.contains(s)) {
     return false;
   }
   for(int index = 1; index < s.length(); ++index) {
       String left = s.substring(0, index);
       if (dict.contains(left)) {
         solution.add(left);
         boolean ret = wordBreakHelper(s.substring(index), dict, unsplitableSet, solution);
         if (ret) {
           return ret;
         } else {
           solution.remove(solution.size() - 1);
         }
       }
    }
    unsplitableSet.add(s);
    return false;
 }


既然上面的算法避免了相同子串的重复计算,那么以上这个算法的时间复杂度是多少呢?


考虑到所有重复计算已经被排除,那么我们不防穷举所有可能的计算尝试,同样以abcde为例:


以a开头的尝试:a, ab, abc, abcd, abcde,
以b开头的尝试:b, bc, bcd, bcde,
以c开头的尝试:c, cd, cde,  

显然所有的计算尝试是n平方的复杂度(n位字符串长度),考虑到所有的尝试只计算一次(算法中的unsplitableSet起到的作用),因此上面的时间复杂度位n的2次方。