| Topic R |
| Robot.txt |
It is the standard protocol or procedure that prohibits or disallows the to crawl on the particular area of the web site. In other words it prevents spiders to access all areas of web site.
The simple text file which is placed on your root directory to instruct which areas of your web site they area allowed to visit and index. |
| Robot helps search engines for the documentation & categorizing the web sites. Actually crawler is the special program designed in the search engine. After specific time period these crawl over the pages to check the changes and information added by web citizens. |
But before crawling they check two things.
(1) the number of pages
(2) robot.txt |
Firstly the number of pages informs them how many pages the web site contains .Say for example has 22 pages
|
| Secondly robot text instructs crawlers how many pages have to be visited and indexed by the search engines.. Now say for example has permitted only 6 pages to be visited. Here robot.txt will instruct the spiders to crawl only six pages. |
Rules with example
“*” this wildcard allows robots to visit all files
User-agent: *
Disallow:
/ it keeps all the robots out.
User-agent: *
Disallow: / |
The next example disallow crawlers to enter into four directories.
User-agent: *
Disallow: /cgi-bin/
Disallow: /images/
Disallow: /tmp/
Disallow: /private/ |
Next example prevent crawler to enter into specific file
User-agent: *
Disallow: /directory/file.html |
Images could be also re moved from Google’s search engine by using following syntax .
User-agent: Googlebot-Image
Disallow: /images/cats.jpg |
| Replacing Metatags with robot |
To prevent the processing of the robots could be used. For example,
<meta name="robots" content="noindex,nofollow" /> |
| |
| |
| |