The Robots.txt is a text file that webmasters create to tell robots (usually search engine crawlers) how to crawl their website pages.
Its creation and optimization are important in SEO.
What is the use of the Robots.txt file?
The Robots.txt tool is a file that tells search engine bots not to crawl certain pages, directories, and sections of a website.
Most of the major online search engines like Google, Bing or Yahoo, recognize and respect the protocol formulated in the Robots.txt files.
SEO log analysis helps to understand how search engine robots interact with the site and to optimize Google's crawl budget by avoiding indexing unnecessary parts of the site.
The robots.txt file is part of the Robots Exclusion Protocol (REP), a group of web standards that governs how robots:
- explore the web
- Access and index content.
- Serve this content to users.
The REP also includes guidelines such as meta robots, as well as instructions regarding how search engines treat links (such as "follow" or "nofollow").
Basic format:
User-agent : [user-agent name]
Disallow: [the URL string should not be crawled]
These two lines are considered a complete robots.txt file . It can contain multiple sets of directives.
Each set of directives appears as a set separated by a newline. It can be created via a simple text editor.
How does the robots.txt file work?
Search engine bots like GoogleBot crawl the web for content and index that content so that it can be offered in their results based on their relevance.
Search engine spiders, once they arrive at a website, look for the robots.txt file before crawling the site. It instructs crawlers on how it should analyze the site in question. If it contains no instructions or is absent, the robot will crawl the site without restrictions.
Why is the Robots.txt file important in SEO?
The robots.txt file tool is not crucial for many websites, especially smaller ones, but creating and using it can serve many purposes , some of which can be of real importance in terms of security and privacy. SEO optimization.
Before a robot such as Googlebot crawls a webpage, it first checks to see if it exists, and if there is, it will generally follow and obey the instructions in this file.
Some useful functions of robot.txt:
- Prevent the indexing of a sensitive page or directory (admin, login pages, e-commerce basket, etc.).
- Prevent server overload.
- Blocking access to entire sections of your site, blocking by password remains more prudent.
- Prevent your site's internal search results pages from being crawled, indexed, or displayed in search results.
- Prevent duplicate content from appearing in SERPs.
- Specify the location of sitemaps for robots and facilitate indexing.
- Maximize crawl budget by blocking unimportant pages, Googlebot can spend more of your crawl budget on the really important pages.
- Prevent search engines from indexing certain files on your website (images, PDFs, etc).
Note that although Google does not generally index blocked web pages in the robots.txt file, but there is no way to guarantee exclusion from search results using this file (prefer the noindex or other more reliable methods).
It is also useless to block old pages that contain 301 redirects or 404 errors in order to let the robots crawl them and take into account the changes.
What are Google User-agents?
Each search engine identifies itself with a different user agent. It is possible to define custom instructions for each of them in your robots.txt file.
Be aware that it can include directives for as many user agents as you want. You can use the star (*) wildcard character to assign directives to all user agents.
There are hundreds of user agents, here are the main ones to identify Google spiders.
[table id=3/]
How to use the Robots.txt and the sitemap?
You can use the robot.txt to specify the location of your sitemap for search engines.
Here is an example in 2 lines using the sitemap directive:
User-agent : *
Allow: /Sitemap : https://www.nomdusite/sitemap_index.xml
Note that it is not necessary to repeat the sitemap directive multiple times for each user agent. So it's best to include sitemap directives at the beginning or end of your robots.txt file.
You can include as many sitemaps as you want.
What are the main directives of the robot.txt?
Here are the main guidelines for allowing or disallowing Google crawlers.
Sample robots.txt file
Here are some examples of robots.txt configured for a site www.nomdusite.com. It can be created via a simple text editor.
Robots.txt URL: www.sitename.com/robots.txt
Block all bots from any content
Agent utilisateur : *
Dissalow : /
Using this syntax tells all spiders not to crawl the pages of the site, including the home page.
Allow all spiders to access all content
Agent utilisateur : *
Dissalow:
Using this syntax tells bots to crawl all pages on the site, including the home page.
Blocking a specific crawler from a specific folder
Agent utilisateur : Googlebot
Dissalow: / sous-dossier /
It only tells the Google bot (user agent name Googlebot) not to crawl pages containing the URL string www.sitename.com/subfolder/.
Blocking a specific crawler from a specific web page
Agent utilisateur : Bingbot
Interdit: /sous-dossier/page.html
It only tells Bing's crawler to avoid crawling the specific page at www.sitename.com/subfolder/page.html.
Creating a typical robot.txt
In a typical scenario, your IT should have the following content.
User-agent : *
Allow:
Sitemap : https://www.nomdusite.com/sitemap.xml
How to check the presence of robot.txt on a website?
If you already have one on your website, it will be accessible at:
https://www.nomdusite.ext/robots.txt
How to check your robots.txt file for errors?
You can use Google Search Console to verify your sitemap or tools like XML Sitemap Validator.
https://www.xml-sitemaps.com/validate-xml-sitemap.html
https://support.google.com/webmasters/answer/7451001?hl=en
Where to place your file on your site?
It is necessary to place your robots.txt files in the root directories of the domains or subdomains to which they apply.
For example, to control crawl behavior on sitename.com, it must be accessible at sitename.com/robots.txt.
If you want to control bot scanning on a subdomain such as blog.sitename.com, it should be accessible at blog.sitename.com/robots.txt.
What are the best practices for Robots.txt for SEO?
Here are some tips and good practices to follow to better manage the bans and the referencing of your website.
- Do not block CSS or JS folders. During the crawling and indexing process, Google may show a website as a real user. If your pages need JS and CSS to work properly, they shouldn't be blocked.
- Links on pages blocked by robots.txt will not be tracked. Use a different blocking mechanism if links need to be followed like a
- Do not use it to prevent sensitive data from being referenced or accessed. If you want to block your page or directory from search results, use a different method, such as password protection or the noindex meta directive.
- Test it out and make sure you're not blocking any part of your website that you want to appear in search engines.
- On a WordPress site, it is not necessary to block access to your wp-admin and wp-include folders. WordPress does a great job of using the robots meta tag.
- No need to specify different rules for each search engine, it can be confusing and hard to keep up to date. Better to use user-agent:* and provide one set of rules for all bots.
- If you modify it and want to update it faster, you can send your URL of the modified file to Google.
Top comments (0)