Contrary to popular belief, the search engine spiders sent out by the major search engines do not have to search everything on a site. You can actually technically keep a search engine spider away from a page by instructed it through a certain robots meta tag or a file not to come near the page.
Webmasters can instruct spiders not to crawl certain files or directories through the standard robots.txt file in the root directory of the domain. Additionally, a page can be explicitly excluded from a search engine's database by using a robots meta tag. If for some reason you do not want a search engine spider to crawl a page you do have the means to do so.
When a search engine visits a site, the robots.txt located in the root folder is the first file crawled. The robots.txt file is then parsed, and only pages not disallowed will be crawled. However this is not always fool proof. Search engine spiders have a habit of going away from a page and then coming back and looking at the page a second time later. As a search engine crawler may keep a cached copy of this file, it may on occasion crawl pages a webmaster does not wished crawled.
Pages that most webmasters prefer not be crawled include login specific pages such as shopping carts and user-specific content such as search results from internal searches. Other pages that you might not want crawled, depending on the content might be a guest book that you expect to be filled with spam or a feedback system that is not very flattering to you. It is also a good idea to instruct the spiders not to crawl a page with a lot of animation or flash on it as this can be mistakenly read by a spider as a malfunctioning site.
No comments:
Post a Comment