Thursday, August 21, 2014

How to design a basic web crawler?

包子IT面试培训 助你拿到理想的offer!

This is a frequently asked design question in interviews, not necessarily for Google or Bing. Many other companies will still ask you this question since they maybe have a search engine internally. This question will test your basic understanding of computer science fundamentals.

Keep in mind a production web crawler could be very sophisticated and normally takes a few teams weeks/months to develop. The interviewer would not expect you to cover all the detail, but you should be able to mention some key design perspectives.

How to abstract the internet?
You should quickly realize the internet could be abstracted as a directed graph, with each page as a node and hyperlinks as an edge.

How to crawl?
BFS is normally used. However, DFS is also used in some situation, such as if your crawler has already established a connection with the website, it might just DFS all the URLs within this website to save some handshaking overhead.

Decide what you want to crawl?
Internet is huge thus your graph is huge. It is almost impossible to crawl the entire internet since it is keep growing every sec. Google probably has 60 trillion while Bing has 30 trillion, roughly speaking.

Some strategy used in practice includes

a) Having a few websites in your crawler’s most frequent list to crawl, such as some authoritative news website, etc

b) You should have lots of fetchers living on many host classes. Use machine learning to predict which websites are most likely to have frequent update, put those into the fetchers’ priority queue.

Hot to track what your fetchers have crawled?
You don’t want your fetchers to crawl some website over and over again while other websites don’t get crawled at all. There are many ways to achieve this. For example, your scheduler should generate non-duplicate jobs for fetchers. Or, your fetchers could keep track off the last visited time for each URL. Note, due to the scalability, a consistent hashmap is needed.

Respect the standard?
If a crawler sees a robots.txt on a site, it should skip crawling it. However, if this happens a lot for a fetcher, it is doing less work on the run. Some optimization could be let the fetcher pick some other sites to crawl to reduce the overhead of spawning a fetcher.

Sometime, URL on a website is not easy to extract. For example, some URLs are generated by javascript. You need to have some ways to execute those javascript and get the URL.


No comments:

Post a Comment