Noindex and Nofollow - ways to close pages from indexing

Disavowing pages from being indexed by search engines is a process where the site owner intentionally restricts the access of search robots to certain pages. This means that closed pages will not appear in search results, even if they physically exist on the site.

This practice is important for many web resources, as it allows you to control which pages will be visible to users in search results and which will not. Disavowing unnecessary pages helps to focus the attention of search engines and users on really important content, avoid duplication, and optimize website indexing.

There are several main ways to close pages from indexing:

Using the rel="canonical" attribute
Closing through the robots.txt file
Using the robots meta tag with the noindex and nofollow parameters
and other methods that we will discuss in more detail below.

Proper use of these tools allows you to control how search engines perceive and rank your site, which ultimately affects its visibility and organic traffic. That's why understanding the principles of how to close pages from indexing is an important skill for anyone involved in SEO and web development.

Why do you need to block pages from being indexed by search engines?

There are several good reasons why webmasters and SEO specialists resort to closing certain pages of a website from being indexed by search engines:

Removing unwanted pages from the search engine index. If your site has pages that are of no value to users and can negatively affect the quality of the site by search engines, it is better to close them from indexing. These can be, for example, duplicate pages, temporary promotional pages after the end of promotions, pages with thin content.
Manage the number of indexed pages. A large number of low-quality or irrelevant pages in the index can dilute the overall relevance of the site. By closing such pages, you signal search robots to focus on the really important sections of the site. For example, on the website of an online store with hundreds of thousands of products, it makes sense to exclude pages of irrelevant products from indexing.
Saving the crawling budget. Each website has a certain limit of pages that a search robot can crawl in one visit. Spending this limit on crawling unnecessary pages is unproductive. Closing them from indexing allows you to direct the crawling budget to priority sections. For example, by hiding service pages with duplicate filter settings, you free up the crawler's resources to crawl the main pages.
Hide technical and service pages. Many websites have pages that are intended for internal use, such as test pages, admin pages, and online shopping carts. The appearance of such pages in the search results is undesirable, so they are closed from indexing.
Protecting content from copying. Blocking pages containing unique and valuable content from being indexed makes it harder for competitors to copy it. They simply won't be able to find these pages through search engines. But it is important to understand that this method cannot fully protect the content.
Hiding private information. If your site contains pages with personal data of users, internal company documentation, or other information that should not be publicly available, they must be hidden from indexing.

Using the rel="canonical" attribute

The rel="canonical" attribute is a special HTML link attribute that points search engines to the main (canonical) version of a page. It is used when a website has several pages with similar content to avoid duplicate content issues.

Here's how rel="canonical" works to exclude pages from indexing:

You don't need to add anything to the main page that you want to be indexed.
On all secondary pages with similar content, you need to place a link to the main page with the rel="canonical" attribute. For example: <link rel="canonical" href="https://site.com/main-page" />.
If search engines find such a link, they will realize that this page is not the main page and either not index it or exclude it from the search results in favor of the canonical page.

Benefits of using rel="canonical"

Helps to avoid duplicate content and related problems (page sticking, lower rankings).
Consolidates the number of links and other ranking signals (likes, comments) on the main page.
Easy to implement - just add one line of code to the required pages.
It does not prohibit the indexing of the page completely, but only points to the main version. If for some reason the main page is unavailable, the search engine can index and display an alternative one.

Disadvantages of using rel="canonical"

Search engines perceive rel="canonical" as a signal, not a directive. That is, they can index secondary pages despite the presence of this attribute if they deem it necessary.
If the pages have significant differences in content, using rel="canonical" can lead to a loss of traffic. Search engines will consider these pages to be the same and show only the main page, even if the user's query is more relevant to one of the secondary pages.
Errors in specifying the address of the canonical page (for example, if you specify a non-existent URL) can lead to the page being excluded from the index or other problems.

Therefore, rel="canonical" should be used in cases where the pages really duplicate each other, for example, on sites with session IDs in the URL or to merge pages with and without WWW. But to close unique pages, it is better to use other methods.

Closing pages through the robots.txt file

A robots.txt file is a special text file in the root directory of a website that contains instructions for search engine crawlers. It can be used to block individual pages or entire sections of a website from being indexed.

Here's how it works:

In the robots.txt file, special directives are used to specify the URLs of pages or folders that you want to exclude from indexing.
Search engine crawlers access the robots.txt file and read these instructions before they start crawling the site.
If a robot finds a directive in the file that prohibits access to a particular page or section, it will not crawl or index it.

Syntax and examples of using robots.txt

Here is the basic syntax of the robots.txt file: User-agent: [robot name or *] Disallow: [page or folder URL].

The User-agent directive specifies to which robot the following instructions apply. The value * means all robots.

The Disallow directive specifies which URLs are prohibited for crawling: User-agent: * Disallow: /private/ Disallow: /temp-page.html

These instructions deny all robots access to the /private/ folder and the /temp-page.html page.

You can specify more complex URL patterns, for example, using the special character *: User-agent: * Disallow: /*?*sort=

This will prohibit indexing of all pages with the sort= parameter in the URL, for example /category/shoes?sort=price.

Advantages of closing through robots.txt

Easy to implement - just create a text file and add a few lines.
Works for all major search engines.
Allows you to close entire sections of the site (folders) from indexing with a single directive.
Does not require any changes in the page code.

Disadvantages of closing through robots.txt

The robots.txt file is public, so everyone can see which pages you are trying to hide. It is not suitable for confidential information.
Some unscrupulous robots may ignore the file's instructions and still crawl the banned pages.
If a page has already been indexed before adding it to robots.txt, it will not disappear from the index instantly. Search engines gradually update their databases by removing pages that are no longer accessible.
If there are links to closed pages on other websites or social networks, they can still pass weight to these pages, even if they are not indexed.

Therefore, robots.txt is best suited for technical pages (for example, filter or sorting pages in online stores) that do not have any independent value for users. For important content, it is better to use other methods of closure.

Robots meta tag with noindex and nofollow parameters

The <meta name="robots" content="..."> meta tag is placed in the <head> section of an HTML page and contains instructions for search engine crawlers. It can be used to close a particular page from indexing and prohibit links from being followed from it.

The most important values for closing pages are those of the content attribute:

noindex - disables indexing of the current page. The robot can crawl this page, but will not add it to the index.
nofollow - prohibits following links from the current page. The page can still be indexed, but the weight will not be transferred through links from it.

These values can be combined:

<meta name="robots" content="noindex, nofollow"> - will prohibit both indexing the page and following its links.
<meta name="robots" content="noindex, follow"> - prohibits indexing, but allows you to follow links. Useful when the page itself is not important but contains links to valuable content.

There are other possible meanings (for example, noarchive, nosnippet), but they are not directly related to closing from indexing.

Instead of the general robots tag, you can use more specific tags for individual search engines:

<meta name="googlebot" content="..."> - instructions for Google only.

This allows you to fine-tune the visibility of a page for different search engines.

Advantages of closing pages with meta tags

Gives you finer control than robots.txt - you can close individual pages, not just sections.
Allows you to control not only indexing, but also the passage of link weight (via nofollow).
Instructions in meta tags are not public, unlike robots.txt.
You can give different instructions to different search engines through specific meta tags.

Disadvantages of using meta tags

You need to add tags to each page separately, which can be time-consuming for large sites. This can be partially automated at the CMS or page template level.
As in the case of the robots.txt file, if a page has already been indexed before adding meta tags, it will remain in the index until the next crawl by a search robot.
The noindex and nofollow tags are perceived by search engines as signals, not unconditional directives. In rare cases, a page can be indexed even if it has a noindex tag.

In general, meta tags are a powerful and flexible tool for managing indexing, but you need to use them carefully. If you accidentally put noindex on an important page, you can lose a lot of traffic.