Web Crawlers

June Huang's picture

Looking up information on the Internet has become a daily task for many of us. Thanks to the invention of search engines, it is not laborious to do. Search engines are convenient to use as they produce immediate results from countless sources. From Web pages to images and videos, we are able to search through almost everything anyone can ever find in the Web. To be able to return results, search engines first make use of a computer program called a web crawler that explores the resources on the Web. Web crawlers look at the pages’ contents and store information about the page so that when the user requests something, the search engine can find related resources and return them to the user. In this article I shall give a brief introduction of how search engines manage and find what we are interested in.

To begin with, the web crawler is given a list of URLs. The crawler visits a page and identifies keywords and links. It then determines which pieces of information are worth adding or updating. The web crawler will then download a portion of that page and index some metadata, for example the page’s URL, for future searches. The newly found links are then added to the list of URLs for the crawler to continue exploring.

Web crawlers have to select which pages it should visit to obtain information because there are infinitely many pages on the Internet and pages can be constantly added, modified or deleted. Policies are used to determine whether a page is worth visiting as it is impractical to visit every single page in the Web and possibly visiting it multiple times to check for updates. An example of a policy is Google’s PageRank policy that weighs the importance of a page depending on the links to the page and the PageRank of those pages. The number of pages that link to a specific page represents the page’s importance and therefore contributes to its PageRank. The higher the PageRank the more the page is worth indexing. Distributed web crawling is also used to share the URLs for exploring and page download so as to optimise the crawl through the Web.

References:
[1] Web crawler. (2010, December 22). In Wikipedia, The Free Encyclopedia. Retrieved 11:21, December 29, 2010, from http://en.wikipedia.org/w/index.php?title=Web_crawler&oldid=403711331
[2] PageRank. (2011, January 2). In Wikipedia, The Free Encyclopedia. Retrieved 11:15, January 6, 2011, from http://en.wikipedia.org/w/index.php?title=PageRank&oldid=405547279