Published on May 23rd, 2019
Why should I have robots?text file? Is not it great when search engines visit my site frequently and index my content?
What? You Don’t Want Your Every Online Content To Be Indexed?
Oh, you want to avoid the risk of being imposed a duplicate content penalty. Apart from this, your site might contain sensitive data that you do not want the world to see.
You will also prefer that search engines do not index these pages.
Robots.txt is a text (not HTML) is a simple text file placed on your web server which tells web crawlers like Googlebot if they should access a file or not.
By default search engines are greedy. They want to index as much high-quality information as they can, & will assume that they can crawl everything unless you tell them otherwise. If you specify data for all bots (*) and data for a specific bot (like GoogleBot) then the specific bot commands will be followed while that engine ignores the global/default bot commands.
In practice, robots.txt files indicate whether certain user agents (web-crawling software) can or cannot crawl parts of a website. These crawl instructions are specified by “disallowing” or “allowing” the behavior of certain (or all) user agents.
A robots.txt file lives at the root of your site. So, for site www.example.com, the robots.txt file lives at www.example.com/robots.txt. robots.txt is a plain text file that follows the Robots Exclusion Standard. A robots.txt file consists of one or more rules. Each rule blocks (or allows) access for a given crawler to a specified file path in that website. However, before you create or edit robots.txt, you should know the limits of this URL blocking method. At times, you might want to consider other mechanisms to ensure your URLs are not findable on the web.
- Creating Robots.txt File
Websites always store the robots.txt at the root of the website. The search engine spiders look for the robots.txt on the root of a domain. It will not look anywhere else on the website so you can’t actually specify a different location for it. Robots.txt is always named in lower case, as well.
Robots.txt instructs the search engine spiders as to what part of the website to index and what parts should be ignored. It somehow controls the action of the search engine spiders in such a way that it directs its movement.
It is like a recommendation – robots.txt recommends which part of the sites to index.
2. Creating Robots.Txt File
How do you create a robots.txt file? First, you need to identify what should be included in the robots.txt file. A robots.txt file is like a list of instructions. One part of the robots.txt file is the User-agent. It is the one that tells the robots or spiders reading the file which robots should pay attention to the instructions. Oftentimes User-agent indicates “*” which means “all robots”.
After the User-agent follows the rules themselves. Remember that there should not be any blank lines in the instructions. The instructions in the robots.txt oftentimes follow these formats:
- Disallow: /folder/
- Disallow: /file.htm
Each line must bear only one instruction. If you put anything after “#”, that will be completely ignored since the spiders will consider it as a comment. So it is advised to write a comment on a separate line all by itself.
In creating your site’s robots.txt file, be extra careful with your syntax and commands. Once the robots fail to recognize a certain command in your robots.txt, it may interpret the wrong notion that you want them to stay away from indexing your website. The incorrect syntax may also prevent your entire site from being indexed by the robots.
What you need to do is create the robots.txt file and check it twice or even thrice before you upload it. This will ensure proper indexing of your site. This practice will also minimize the probability of committing errors in commands and syntax.