Its creation and optimization are important in SEO.
The Robots.txt tool is a file that tells search engine bots not to crawl certain pages, directories, and sections of a website.
Most of the major online search engines like Google, Bing or Yahoo, recognize and respect the protocol formulated in the Robots.txt files.
SEO log analysis helps to understand how search engine robots interact with the site and to optimize Google's crawl budget by avoiding indexing unnecessary parts of the site.
The robots.txt file is part of the Robots Exclusion Protocol (REP), a group of web standards that governs how robots:
- explore the web
- Access and index content.
- Serve this content to users.
The REP also includes guidelines such as meta robots, as well as instructions regarding how search engines treat links (such as "follow" or "nofollow").
User-agent : [user-agent name]
Disallow: [the URL string should not be crawled]
These two lines are considered a complete robots.txt file . It can contain multiple sets of directives.
Each set of directives appears as a set separated by a newline. It can be created via a simple text editor.
Search engine spiders, once they arrive at a website, look for the robots.txt file before crawling the site. It instructs crawlers on how it should analyze the site in question. If it contains no instructions or is absent, the robot will crawl the site without restrictions.
The robots.txt file tool is not crucial for many websites, especially smaller ones, but creating and using it can serve many purposes , some of which can be of real importance in terms of security and privacy. SEO optimization.
Before a robot such as Googlebot crawls a webpage, it first checks to see if it exists, and if there is, it will generally follow and obey the instructions in this file.
Some useful functions of robot.txt:
- Prevent the indexing of a sensitive page or directory (admin, login pages, e-commerce basket, etc.).
- Prevent server overload.
- Blocking access to entire sections of your site, blocking by password remains more prudent.
- Prevent your site's internal search results pages from being crawled, indexed, or displayed in search results.
- Prevent duplicate content from appearing in SERPs.
- Specify the location of sitemaps for robots and facilitate indexing.
- Maximize crawl budget by blocking unimportant pages, Googlebot can spend more of your crawl budget on the really important pages.
- Prevent search engines from indexing certain files on your website (images, PDFs, etc).
Note that although Google does not generally index blocked web pages in the robots.txt file, but there is no way to guarantee exclusion from search results using this file (prefer the noindex or other more reliable methods).
It is also useless to block old pages that contain 301 redirects or 404 errors in order to let the robots crawl them and take into account the changes.
Be aware that it can include directives for as many user agents as you want. You can use the star (*) wildcard character to assign directives to all user agents.
There are hundreds of user agents, here are the main ones to identify Google spiders.
Here is an example in 2 lines using the sitemap directive:
User-agent : *
Allow: /Sitemap : https://www.nomdusite/sitemap_index.xml
Note that it is not necessary to repeat the sitemap directive multiple times for each user agent. So it's best to include sitemap directives at the beginning or end of your robots.txt file.
You can include as many sitemaps as you want.
Here are some examples of robots.txt configured for a site www.nomdusite.com. It can be created via a simple text editor.
Robots.txt URL: www.sitename.com/robots.txt
Agent utilisateur : *
Dissalow : /
Using this syntax tells all spiders not to crawl the pages of the site, including the home page.
Agent utilisateur : *
Using this syntax tells bots to crawl all pages on the site, including the home page.
Agent utilisateur : Googlebot
Dissalow: / sous-dossier /
It only tells the Google bot (user agent name Googlebot) not to crawl pages containing the URL string www.sitename.com/subfolder/.
Agent utilisateur : Bingbot
It only tells Bing's crawler to avoid crawling the specific page at www.sitename.com/subfolder/page.html.
In a typical scenario, your IT should have the following content.
User-agent : *
Sitemap : https://www.nomdusite.com/sitemap.xml
If you already have one on your website, it will be accessible at:
You can use Google Search Console to verify your sitemap or tools like XML Sitemap Validator.
For example, to control crawl behavior on sitename.com, it must be accessible at sitename.com/robots.txt.
If you want to control bot scanning on a subdomain such as blog.sitename.com, it should be accessible at blog.sitename.com/robots.txt.
Here are some tips and good practices to follow to better manage the bans and the referencing of your website.
- Do not block CSS or JS folders. During the crawling and indexing process, Google may show a website as a real user. If your pages need JS and CSS to work properly, they shouldn't be blocked.
- Links on pages blocked by robots.txt will not be tracked. Use a different blocking mechanism if links need to be followed like a
- Do not use it to prevent sensitive data from being referenced or accessed. If you want to block your page or directory from search results, use a different method, such as password protection or the noindex meta directive.
- Test it out and make sure you're not blocking any part of your website that you want to appear in search engines.
- On a WordPress site, it is not necessary to block access to your wp-admin and wp-include folders. WordPress does a great job of using the robots meta tag.
- No need to specify different rules for each search engine, it can be confusing and hard to keep up to date. Better to use user-agent:* and provide one set of rules for all bots.
- If you modify it and want to update it faster, you can send your URL of the modified file to Google.