10

Recently Google has been complaining about certain pages saying:

Indexed, though blocked by robots.txt

I am confounded by this error. Yes the page, is blocked by robots.txt and it has always been. Nothing new has happened and I don't want it crawled or indexed. Why is google indexing the page when I explicitly telling it not to? I realize I can add a meta tag like <meta name="robots" content="noindex"> but why should this be necessary?

You Old Fool
  • 483
  • 1
  • 6
  • 18

3 Answers3

10

Google isn't crawling your page, but it is indexing the URL. It isn't indexing the content of the page, just the URL itself, possibly along with anchor text of links that point to it. Google says:

A robotted page can still be indexed if linked to from from other sites While Google won't crawl or index the content blocked by robots.txt, we might still find and index a disallowed URL if it is linked from other places on the web. As a result, the URL address and, potentially, other publicly available information such as anchor text in links to the page can still appear in Google search results. To properly prevent your URL from appearing in Google Search results, you should password-protect the files on your server or use the noindex meta tag or response header (or remove the page entirely).

The reason for this is that some important sites don't allow any crawling. One such site is (or was) the California DMV. It is important that users be able to search for the California DMV even if Google can't crawl the site. Google's Matt Cutts posted about this issue in 2006.

When Google indexes a page that is blocked by robots.txt it usually appears in the search results something like this (image source):

If you don't want the page indexed at all, you have to let Google crawl it and use the <meta name="robots" content="noindex"> tag. Keep in mind that if the page is blocked by robots.txt, Google will never be able to see that tag and the URL will still be indexed.

The other "experimental" option would be to use Noindex: rather than Disallow: in robots.txt. See How does “Noindex:” in robots.txt work? The only downside to this is that Google says they may stop supporting it at any point. Other search engines won't know what to do with that directive, so you would have to put it in a Google specific section of robots.txt. In 2019, Google announced that it no longer supports a noindex: directive in robots.txt.

Stephen Ostermiller
  • 99,822
  • 18
  • 143
  • 364
0

As per my analysis, you want to implement noindex & disallow for particular pages or category or tags.

Noindex: When you implement noindex for a page; those pages are not indexed on SERP but, a robot can still crawl those pages.

Disallow: When you implement disallow for a file/ page/ directory those pages are not crawled by the robots but, appeared on search results. If that is the case first you need to set noindex for those pages. After the site is crawled then you need to implement disallow in robots.txt file.

You Old Fool
  • 483
  • 1
  • 6
  • 18
cstpl123
  • 11
  • 1
0

It is a common issue but it happens when we block internal or external linked pages. You can remove those links or You can wait to resolve it automatically. As you stated that those posts are already indexed then you need to implement noindex tag and remove disallow from robots.txt