Web Crawlers - Crawling Policies

June Huang's picture

Continuing from my last blog entry on web crawlers, let me now give a more detailed explanation as to how web crawlers traverse the Web. Web crawlers use a combination of policies to determine their crawling behavior, such policies include a selection policy, a revisit policy, a politeness policy and a parallelization policy. I shall discuss each of these as follows.

As only a percent of the Web can be downloaded, a web crawler must use a selection policy to determine which resources are relevant to download. This is more useful than downloading a random portion of the Web. An example of a selection policy is the PageRank policy (Google) where the importance of a page is determined by the links to and from that page. Other examples of selection policies are based on the context of the page and the resources’ MIME type.

Web crawlers use revisiting policies to determine the cost associated with an outdated resource. The goal is to minimize this cost. This is important because resources in the Web are continually created, updated or deleted; all within the time it takes a web crawler to finish its crawl through the Web. It is undesirable for the search engine to return an outdated copy of the resource. The cost to revisit the page are based on freshness and age, where freshness focuses on whether or not the local copy is the current copy of the resource and age focuses on how long ago the local copy was updated.

The politeness policy is used so that the performance of a site is not heavily affected whist the web crawler downloads a portion of the site. The server may be overloaded as it has to handle the requests of the viewers of the site as well as the web crawler. Solutions proposed to alleviate the load are: introducing an interval that restricts the web crawler from overloading server with requests and the robot exclusion protocol where the administrators indicate which portions of the site are not to be accessed by the crawler.

Parallelization policies are used to coordinate multiple web crawlers crawling the same Web space. The goal is to maximize the download rate of the resources as well as refraining the web crawlers from downloading the same pages.

[1] Web crawler. (2011, February 22). In Wikipedia, The Free Encyclopedia. Retrieved 16:24, March 4, 2011, from http://en.wikipedia.org/w/index.php?title=Web_crawler&oldid=415343979