37

I have a link to PDF document on a public web-page. How do I prevent search engines from indexing this link and PDF document?

The only idea I thought of is to use CAPTCHA. However, I wonder if there are any magic words that tell a search engine to not index the link and PDF document? Options using PHP or JavaScript are also fine.

Just to make it clear. I do not want to encrypt PDF and protect it with password. I just want to make it invisible for search engines, but not for users.

unor
  • 21,919
  • 3
  • 47
  • 121

5 Answers5

40

To prevent your PDF file (or any non HTML file) from being listed in search results, the only way is to use the HTTP X-Robots-Tag response header, e.g.:

X-Robots-Tag: noindex

You can do this by adding the following snippet to the site's root .htaccess file or httpd.conf file:

<Files ~ "\.pdf$">
  Header set X-Robots-Tag "noindex, nofollow"
</Files>

Note that for the above to work, you must be able to modify the HTTP headers of the file in question. Thus you may not be able to do this, for example, on GitHub Pages.

Also note that robots.txt does not prevent your page from being listed in search results.

What it does is stop the bot from crawling your page, but if a third party links to your PDF file from their website, your page will still be listed.

If you stop the bot from crawling your page using robots.txt, it will not have the chance to see the X-Robots-Tag: noindex response tag. Therefore, never ever ever disallow a page in robots.txt if you employ the X-Robots-Tag header. More info can be found on Google Developers: Robots Meta Tag.

Pacerier
  • 631
  • 6
  • 16
22

There are multiple ways to do this (combining them is obviously a sure way to accomplish this):

1) Use robots.txt to block the files from search engines crawlers:

User-agent: *
Disallow: /pdfs/ # Block the /pdfs/directory.
Disallow: *.pdf  # Block pdf files. Non-standard but works for major search engines.

2) Use rel="nofollow" on links to those PDFs

<a href="something.pdf" rel="nofollow">Download PDF</a>

3) Use the x-robots-tag: noindex HTTP header to prevent crawlers from indexing them. Place this code in your .htaccess file:

<FilesMatch "\.pdf$">
  header set x-robots-tag: noindex
</FilesMatch>
Zistoloen
  • 10,056
  • 6
  • 36
  • 59
John Conde
  • 86,484
  • 28
  • 150
  • 244
2

If you nginx powered development instances are showing up in Google search results, there is a quick and easy way to prevent search engines from crawling your site. Add the following line to the location block of your virtualhost configuration file for the block that you want to prevent crawling.

add_header  X-Robots-Tag "noindex, nofollow, nosnippet, noarchive";
James M
  • 121
  • 3
1

Not sure if this sill might bring some value to anyone, but we have recently encountered a problem that our on-premise GSA box is unwilling to index PDF file.

Google Support worked with the issue and their response is that it is related to the fact that this PDF document has a custom property set (File -> Document Properties -> Custom (tab))

name: robots
value: noindex

which prevented it from being properly indexed by GSA.

If you have access to the document and can modify it's properties this might work ... at lease for GSA.

ChiTec
  • 11
  • 1
1

You can use robots.txt file. You can read more here.

Zistoloen
  • 10,056
  • 6
  • 36
  • 59
enoyhs
  • 143
  • 1