URLs with 'NoIndex` in robots.txt are being indexed by Google

Question

In my robots.txt file (http://www.tutorvista.com/robots.txt), I'm using Noindex: /content/... to disallow indexing:

This should mean that http://www.tutorvista.com/content/ and anything below this URL shouldn't be indexed. But in the image of my search results below, you can see that pages under this URL are being indexed:

Additionally, I'm using Disallow: /biology/ which means that http://www.tutorvista.com/biology/ and anything below this shouldn't be crawled. But in the image of my search results, you can see that pages under this URL are being crawled and indexed.

So can anyone tell me what's wrong with my robots.txt directives?

score 7 · Answer 1 · answered Sep 01 '17 at 07:05

"noindex" directives should not be used in your robots.txt file, instead a noindex meta tag should be added to any pages that you don't want indexed in Google.

A NOINDEX tag looks like the below and it should be placed in the section of any page you do not want indexed:

<meta name="robots" content="noindex">

More information can be found here.

In the second example while you do have "Disallow: /biology/" in your robots.txt file, a few lines above this you also have "Allow: /biology/animations/" hence why this page in indexed in your example.

Hope this helps!

score 6 · Answer 2 · answered Sep 01 '17 at 11:30

Note that Noindex is not part of the original robots.txt specification. Google supported it as experimental feature (see: How does “Noindex:” in robots.txt work?), but it’s not clear if that is still the case (as they didn’t document it to begin with). But let’s assume it is.

Your robots.txt has two problems.

Empty lines

A record must not contain empty lines. Empty lines are used to separate records.

A conforming bot (which doesn’t identify as Googlebot-Image/Adsbot-Google/Mediapartners-Google) uses this record:

User-agent: *
Allow: /

So none of the following Disallow/Allow/Noindex lines apply.

Of course a bot may try to "fix" this and interpret the following lines to be part of this record (i.e., ignoring the blank lines), but the robots.txt spec doesn’t define this, so I wouldn’t count on it.

`...` in `Noindex` values

If Noindex works like Disallow (which we don’t know for sure, as Noindex is not specified/documented, but I guess it wouldn’t make sense to specify it differently), the ... you appended to the values mean that ... must appear in the URLs you want to noindex.

The line

Noindex: /content/biology/...

would apply to a URL like /content/biology/.../foobar, but not to a URL like /content/biology/foobar nor /content/biology/.

So if you want every URL whose paths starts with /content/biology/ to be noindexed, you would have to specify:

Noindex: /content/biology/

URLs with 'NoIndex` in robots.txt are being indexed by Google

2 Answers2

Empty lines

... in Noindex values

`...` in `Noindex` values