Wednesday, August 13, 2014

关于String Edit Distance问题的总结


包子IT面试培训 助你拿到理想的offer!


[例题1]
找到一个字典中与当前输入string的edit distance [1],(edit distance通常指最小的edit distance,即从一个单词通过add,delete, replace变成另一个单词所需要的最小步骤数),为1的词
[思路]
最简单的方法就是把输入的string和字典里每个词比较edit distance,如果是一就返回
比较好的edit distance算法要求n^2时间复杂度  如果n是两个字符串的长度
这样假设字典有m个词,那总时间复杂度就是m*n^2,非常慢


我们通常想到的string matching against一个string set的方法是给string set建立trie。这道题不能直接用这种方法,因为我们要求edit distance为1。实际上,edit distance为1就是允许trie里的string有1个字符和输入字符不匹配。这种不匹配既可以是字典里的词多了一个letter,可以是输入的string多了一个letter,也可以是这两个词有一个letter不一样。对于这道题来说,依然为dict建立一个trie,依然去匹配输入的string,在匹配时(只)允许有一个字符不匹配,然后比较输入string和字典里的每一个词,这样在trie里就可以找到所有edit distance为1的词
实现时我们借用通配符的概念,如果两个string已经有一个letter不一样,那就用掉了这个通配符,这时如果还有不匹配的letter,那就不用继续比较当前两个词了


[例题2]
找到一个字典中与当前输入string的edit distance小于k的词,通常用于文档中拼写的自动纠正当中。
[思路]
这道题看起来很像例题一,但是如果也用通配符的办法不会符合题意,因为edit distance一般指最小distance, 否则就没有意义了。如果这道题也用通配符的话,有可能重复操作一个字符(比如加上a,再减去a)
Three ways to search for minimum edit distance in a dictionary:


1. Naive approach
The obvious way of doing this is to compute the edit distance from the query term to each dictionary term, before selecting the string(s) of minimum edit distance as spelling suggestion. This exhaustive search is inordinately expensive.
Source: Christopher D. Manning, Prabhakar Raghavan & Hinrich Schütze: Introduction to Information Retrieval.


The performance can be significantly improved by terminating the edit distance calculation as soon as a threshold of 2 or 3 has been reached.


2. Peter Norvig
Generate all possible terms with an edit distance <=2 (deletes + transposes + replaces + inserts) from the query term and search them in the dictionary.
For a word of length n, an alphabet size a, an edit distance d=1, there will be n deletions, n-1 transpositions, a*n alterations, and a*(n+1) insertions, for a total of 2n+2an+a-1 terms at search time.
Source: Peter Norvig: How to Write a Spelling Corrector.


This is much better than the naive approach, but still expensive at search time (114,324 terms for n=9, a=36, d=2) and language dependent (because the alphabet is used to generate the terms, which is different in many languages and huge in Chinese: a=70,000 Unicode Han characters)


3. Symmetric Delete Spelling Correction (FAROO)
Generate terms with an edit distance <=2 (deletes only) from each dictionary term and add them together with the original term to the dictionary. This has to be done only once during a pre-calculation step.
Generate terms with an edit distance <=2 (deletes only) from the input term and search them in the dictionary.
For a word of length n, an alphabet size of a, an edit distance of 1, there will be just n deletions, for a total of n terms at search time.


This is three orders of magnitude less expensive (36 terms for n=9 and d=2) and language independent (the alphabet is not required to generate deletes).
The cost of this approach is the pre-calculation time and storage space of x deletes for every original dictionary entry, which is acceptable in most cases.


The number x of deletes for a single dictionary entry depends on the maximum edit distance: x=n for edit distance=1, x=n*(n-1)/2 for edit distance=2, x=n!/d!/(n-d)! for edit distance=d (combinatorics: k out of n combinations without repetitions, and k=n-d),
E.g. for a maximum edit distance of 2 and an average word length of 5 and 100,000 dictionary entries we need to additionally store 1,500,000 deletes.


Remark 1: During the precalculation, different words in the dictionary might lead to same delete term: delete(sun,1)==delete(sin,1)==sn.
While we generate only one new dictionary entry (sn), inside we need to store both original terms as spelling correction suggestion (sun,sin)


Remark 2: There are four different comparison pair types:


dictionary entry==input entry,
delete(dictionary entry,p1)==input entry
dictionary entry==delete(input entry,p2)
delete(dictionary entry,p1)==delete(input entry,p2)
The last comparison type is required for replaces and transposes only. But we need to check whether the suggested dictionary term is really a replace or an adjacent transpose of the input term to prevent false positives of higher edit distance (bank==bnak and bank==bink, but bank!=kanb and bank!=xban and bank!=baxn).


Remark 3: Instead of a dedicated spelling dictionary we are using the search engine index itself. This has several benefits:


It is dynamically updated. Every newly indexed word, whose frequency is over a certain threshold, is automatically used for spelling correction as well.
As we need to search the index anyway the spelling correction comes at almost no extra cost.
When indexing misspelled terms (i.e. not marked as a correct in the index) we do a spelling correction on the fly and index the page for the correct term as well.
Remark 4: We have implemented query suggestions/completion in a similar fashion. This is a good way to prevent spelling errors in the first place. Every newly indexed word, whose frequency is over a certain threshold, is stored as a suggestion to all of its prefixes (they are created in the index if they do not yet exist). As we anyway provide an instant search feature the lookup for suggestions comes also at almost no extra cost. Multiple terms are sorted by the number of results stored in the index.


Reasoning
In our algorithm we are exploiting the fact that the edit distance between two terms is symmetrical:


We can generate all terms with an edit distance <2 from the query term (trying to reverse the query term error) and checking them against all dictionary terms,
We can generate all terms with an edit distance <2 from each dictionary term (trying to create the query term error) and check the query term against them.
We can combine both and meet in the middle, by transforming the correct dictionary terms to erroneous strings, and transforming the erroneous input term to the correct strings.
Because adding a char on the dictionary is equivalent to removing a char from the input string and vice versa, we can on both sides restrict our transformation to deletes only.
We are using variant 3, because the delete-only-transformation is language independent and three orders of magnitude less expensive.


Computational Complexity
Our algorithm is constant time ( O(1) time ), i.e. independent of the dictionary size (but depending on the average term length and maximum edit distance), whereas e.g. BK-Trees have a search time of O(log dictionary_size).


[例题3]
找到一个字典中所有edit distance==k的词的pair
[思路]
这道题可以利用例题2中的思路,先把每个词任意删除k/2个字符的所有可能记录下来,再看那些词的这些变化有重复


[Reference]
Edit distance:

No comments:

Post a Comment