Ahrefs reached max crawl depth, could it be caused by robots.txt changes?

Question

Three weeks ago I add this in my robots.txt to disallow pagination so it will not appear in the Google index:

User-Agent: *
Disallow: /blog/1
Disallow: /blog/2
Disallow: /blog/3
Disallow: /blog/4
Disallow: /blog/5
Disallow: /blog/6
Disallow: /blog/7
Disallow: /blog/8
Disallow: /blog/9

When Google was allowed to crawl these pagination pages, they gave result like this when I searched a keyword in my blog:

Is my robots.txt line above I intend to disallow only the pagination, but not the articles. The crawler should be able to still index the article.

Today Ahrefs reported around 173K uncrawled pages. Is it caused by my new robots.txt disallows? I suspect it is either that or because after migration there things still need time to settle.

There is this message too:

The website wasn't fully crawled

The crawl has reached the maximum depth level from the seed, and the website has not been crawled completely.

To crawl deeper levels of your site, increase the "Max depth level from seed" in the project settings and start a new crawl.

You might also want to check why some of the URLs on your website are considerably distant from the seed.

Can someone explain it to me about this? I'm confused since it is the first time I've stumbled upon this problem.

score 4 · Accepted Answer · answered Aug 10 '21 at 04:26

Robots.txt Misinformation

There is a common misconception that disallowing URLs in a robots.txt record is a good method to de-index content.

This is misguided. It is not your fault for thinking to do this, it is an extremely common tactic.

I am hired to recover websites from exactly this at least once every three months.

The proper way to keep new content out of Google's index, or remove existing content from it, is via meta robots no-index:

<meta name="robots" content="noindex">

OR

Password protect the pages you are working on.

To prevent further issues, please revert your robots.txt disallows ASAP and apply robots meta no-index tags instead.

Whether or not they are ranking on the 1st page or the 100th page, most of your pages are likely receiving impressions for some search queries. When Google crawls your website, most often it is to check existing content for changes, not look for new content. For example, in Search Console under the sitemaps tab that one of your sitemaps was "last read" today, but a piece of content you published a few days ago was not included.

When content is all of a sudden is blocked from Google (they respect robots.txt) they make assumptions.

Saying you can't see this content vs please don't index this content send different signals.

To Fix Your "Max Depth" Issue

In Ahrefs go to your site audit settings and increase the maximum crawl depth that is applied. The default should be something like 5 if I am not mistaken.

This means that your link depth exceeds 5 directories like so:

https://example.com/dir-1/dir-2/dir-3/dir-4/dir-5/

In any case, if you are experiencing this it concerns me a bit. That is highly excessive. Pages past dir-3 are likely starved for PageRank. We want to keep your URL structure as clean and concise as possible.

Please read up on internal linking and website hierarchy best practices.

I really like how Brian approaches internal linking. That article is better than most. The Databox article on structure is pretty new (haven't read it in full nor do I know of the author) but it looks very legitimate.

Identifying the Root Cause of You Problem

The reason that you might attempt to remove pages from Google's index is likely because you believe one of the following is hurting your SEO:

Index bloat
Crawl budget being wasted
Keyword cannibalization

Quality over quantity...Always

Ahrefs reached max crawl depth, could it be caused by robots.txt changes?

The website wasn't fully crawled

1 Answers1

Robots.txt Misinformation

To Fix Your "Max Depth" Issue

Identifying the Root Cause of You Problem