| Search Engine Spiders |
| What are Spiders: |
|
| uses to index web pages. Spiders are program or a automated script which crawl website pages and store data in database of search engine. In general you can say , Agents, Bots, Robots are the synonyms of spiders. The process of searching web pages from word wide web is called web crawling or spidering. Some of the spiders creates a copy of visited pages for further processing of search engine or to give fast results searches. |
Checking Links |
|
| In World Wide Web hyperlinks are very important factor it gives authority to jump on one page to another. Spiders or you can say agents follow these and also crawl data in link pages. Web crawlers start there search with a list of url to visit. ( example www.marketraise.com ) Whenever they crawl url they also identify the hyperlinks and add that hyperlinks in their url list to visit, List of url that spiders used for visit are called “seeds” (example www.marketraise.com)and the hyperlinks that they add after searching url are called “crawl frontier” (example www.marketraise.com/services.php ) |
| Beneficial to know about bad spiders |
|
| If we are talking about spiders than its beneficial to know about bad spiders, “Not all are good” some agents or spiders are generated from software such as Teleport Pro we don’t know who the owners of these types of spiders are. But they are not good for your. It is an application which give chance to a full mirror of your site. So think about these types of spiders. If any one do this type of work with your site than its not time to sit and let this happen. If you want to stop this procedure you have to write only two lines in your robots.txt |
User-agent: NameOfAgent
Disallow: / |
| Don’t use a blank robot.txt it means you don’t want spiders to crawl your site. For detail info about click here www.marketraise.com |
| Selection Policy |
|
| According to study by (Lawrence and Giles, 2000) Lawrence (NEC Research which was responsible for the creation of the Search Engine). He is currently an employee at Google and giles(He is also Professor of Computer Science and Engineering, Professor of Supply Chain and Information Systems, and Director of the Intelligent Systems Research Laboratory) index only 16% of the web and downloads a fraction of web pages but the downloaded fraction contains relevant pages. |
| The relevant pages that spiders download have its own importance according to or visits and even of its URL. Creating a good selection policy is very difficult. It must work on limited information as the complete set of web is unknown during crawling process. |
| Najork and Wiener (Najork and Wiener, 2001) he did his practical on 328 million pages, using BFS ordering. In this practical they found a webpage having gets crawl early. The explanation behind that is “the page having a page rank have many links from numerous hosts, and those links found early”. |
| |
| |