More reasons to use Robots.txt
- Jul. 31, 2003
Some disturbing news today shows just how effective search engines are at crawling and indexing a site. The report says that hackers are using search engines like Google to find pages containing passwords and other personal information for a website and then using this information to hack the website and take it over.
We previously told you how to use your log files to detect hack attempts. Well this goes one step further, by trying to secure your site from the prying eyes of spiders and reducing the chances of them finding these types of pages or folders. By using a simple file called "robots.txt" you can add another layer of security to your website. This file is used to tell the search engine spiders which folders are accessible and which are off limits.
Be warned however, that an improperly coded robots.txt file can also do more harm than good. Since search engine spiders generally request this file before indexing a site, you could force them to not index it by inadvertently disallowing certain portions of the site which you need to have indexed in the search engines. For more information visit this site to understand what the robots.txt is and how to use it properly.
When a site does not have the robots.txt or has it improperly coded, a search engine spider is not offered any guidance on what is or isn't acceptable. Without the file most spiders assume that any link they find is crawlable and indexable. While this isn't a problem for most sites, there are times when this could cause troubles. Let me give you an example of such an exploit.
One common filename to be exploited is called "bash history". Bash is an acronym for Bourne Again Shell. This is an operating system shell for Unix and Linux which allows you to execute commands within these types of operating systems. The problem with this file is (as you probably guessed) the history it stores when commands are executed. Therefore, if a spider can find this bash history file it will index it and likely cache it (as Google does). By performing a search for "bash history" a hacker could find one of these files, and view the cached version (whether the page was removed or not) which could contain commands, userid's and passwords recently used.
This is only one of many common exploits out there which a hacker could use a search engine to find. Others include simply having a hyperlink on a small (1x1 pixel) image that you, the website owner, know about that has a link to a secure area of the website. If you haven't excluded the path to this file, a spider will follow the link and index the page, allowing a cached version of the page to be viewable by anyone.
So while we recommend caution when implementing a robots.txt file, it may be worth your while to research what it is and how to use it properly to deter these types of hacks. And always remember to view your log files to see if indeed any hack attempts have been made.
Rob Sullivan
Production Supervisor
Searchengineposition.comSearch Engine Positioning
specialists
Tags:




