Indexing refers to the process by which search engines crawl the internet to discover web pages and store this information in an organized database called an index.
Google discovers new web pages by crawling the web and then adds them to its index. To do this, it uses an indexing robot called Googlebot.
To better understand what natural referencing is , it is important for webmasters to know how engines work and what process is in place between the moment content is put online and the moment it is displayed. in Google results.
Here are some questions very often asked by our SEO clients:
- What is indexing on Google?
- How long does it take to index a site on Google?
- My site is indexed on Google, but I have no traffic?
- What is Googlebot?
How do search engines work ?
Search engines work by crawling hundreds of billions of pages using their own web crawlers. These web crawlers are commonly referred to as bots or spider . A search engine navigates online by downloading web pages and following links to discover new ones.
They have three main functions:
- Explore: first stage of the work, browse the Internet in search of content, browsing the code and content for each URL found (site pages, images, videos, PDFs, etc.).
- Indexing: The job of indexing is to store and organize the content found during the crawling process on a server. Once a page is in Google's index, it is available to be displayed on relevant queries formulated by the Internet user.
- Positioning: Last step, present in the search results the content that best meets the request of an Internet user. They are ranked in order of relevance according to a series of specific rules and algorithms.
What is Google's goal?
Google's goal is to provide its users with the best possible renderings in terms of relevance and speed . Hundreds of billions of pages are stored on its servers. Thanks to its algorithms updated several hundred times a year, Google tries to offer the most relevant results according to the search intentions of Internet users.
In order to offer the best, it will set aside duplicate content, content deemed uninteresting, sites that abuse techniques to manipulate search results (spam).
How does Google work in particular?
Web exploration by crawlers and GoogleBot
Google's spiders or crawlers , also known as "Googlebots", roam the entire web, scanning every web page (billions of documents) and exploring its hyperlinks in order to store this data in one or more indexes.
This process continues until the search engine's spider has found, analyzed and indexed virtually as much of the web's visible content as possible.
The best way for Google to find and return to your site is to detect and crawl links from other sites that are backlinking to yours.
The engines see and analyze each web page independently. A website is simply a collection of web pages linked together, using hyperlinks.
The basis of the internet and its network of sites is based on links and their follow-up.
Content indexing in the Google index and its data centers
Once a web page has been crawled , Google analyzes and stores their code in huge data centers, the data centers (Google indexes), ensuring that the data can be presented quickly to Internet users.
Google assigns a unique identifier to each web page and indexes their content to precisely identify the elements that compose it.
This huge database contains all the content that Google has discovered and that it deems relevant enough to offer to Internet users.
Google maintains an additional index, used to store suspected spam sites, sites with duplicate content, and those that are difficult to scan (size issues or structure errors).
Ranking in Google results
The algorithms aim to present a relevant set of high-quality search results that answer the user's query or question as quickly as possible.
When a query is entered into a search engine by a user, all pages deemed relevant are identified from the index and an algorithm is used to prioritize the relevant ones into a set of results ranked in a set order.
The algorithms used to rank the most relevant results are different for each engine. A page that ranks in a specific place for a search query on Google may not rank the same for the same query on Bing.
In order to be able to assign relevance and importance , they use complex algorithms designed to consider hundreds of signals to determine the relevance and popularity of a web page.
- Relevance: Identifying the content of a page corresponds to the user's search intention (the intention is what the searchers seek to accomplish with this search, which is not an easy task for the engines - or the SEO – to understand).
- Popularity: The popularity and authority of a domain is determined by many factors, including the quality and quantity of existing inbound links.
In addition to the query, engines use other relevant data to return results:
- Location: Some queries depend on location and geolocation.
- Detected language: They return content in the user's language.
- Previous search history: They return different results for a query depending on the user's browsing history.
- Device: A different set of results may be returned depending on the device (pc, mobile, tablet) from which the query was made.
In order to transmit the results to the end user of the engine, they must perform certain critical steps:
- Interpretation of the intent of the user's request.
- Identification of the pages in the index associated with the query.
- Display of the result and ranking in order of relevance and popularity
Le Crawl Budget
Google has to crawl billions of new and updated pages. In order not to use resources unnecessarily , it assigns each site a crawl budget which will determine the number of pages it will crawl each day. By optimizing priority and its crawl budget and preventing Googlebot from crawling unnecessary pages, the engine's resources are centralized on the most important content of a website.
SEO-oriented log analysis helps to better understand the behavior and errors encountered by the GoogleBot robot when it crawls the site on the server.
Why a page may not be indexed by Google?
There are a number of circumstances in which a URL or parts of the site will not be indexed:
- The robot.txt tells the engine what should or should not be crawled by its crawlers.
- Noindex tags ask not to index the page.
- A canonical URL is already defined for another page.
- Online content is not considered quality by robots, duplicate content and plagiarism, or too little developed.
- The page returned a server error message when the robot passed (404 errors).
- The page is orphaned and cannot be found.
- The server is unreachable.
How to get a website indexed by Google?
Google can index a new page in different ways, depending on the method used to discover it.
There are many ways to make a new page known to Google:
- Google Bot discovers it on your site via internal links.
- The page is submitted via a sitemap.
- An indexing request is made via the Search Console webmaster tool.
- Receive a link from another site.
How long does it take for a site to be indexed on Google?
Indexing times can vary greatly depending on the popularity of your site, the method of submitting the new page to the search engine, its position on your site (number of clicks from the index), the Google priority.
The delay can range from 30 minutes to several days. However, you should not confuse indexing delays with the positioning delay which is much longer, depending on your referencing actions and not guaranteed.
Google's sandbox effect, myth or reality?
Never confirmed by Google, there is a legend about the Google Sandbox effect which is a filter that is supposed to act on new websites. If a website is placed in the Google Sandbox, the ranking of the site begins to be affected.
The most important keywords and keyword phrases will start to suffer a drop in rankings. Whether your site has lots of inbound links, ranks high on Google, or has great content, it may be affected by the Sandbox effect.
The main purpose of the sandbox effect would be to prevent the appearance of spammy sites or the repetition of the process.
How to check if a website is indexed in Google?
Go to Google, then search through the search tool
The number displayed (here 86) indicates approximately the number of pages of the site indexed by Google.
If you want to check the status of a particular URL, use this Google search
No results will appear if the page is not indexed.
How do I remove a page from Google's index?
A function of the Search Console allows you to request the deletion of an obsolete page.
Simply connect to this page and enter the deleted URL of the server to be deindexed from the engine.
The operation generally takes between 24 and 48 hours.
If you would like an audit of your site and advice on how to improve the crawl, do not hesitate to contact us.
Top comments (0)