Fix “Indexed, though blocked by robots.txt” problem
“Indexed, though blocked by robots. txt” is the status of Google Search Console. This means that Google did not crawl your URL but indexed it nonetheless.
This condition indicates a serious SEO problem that you must address immediately.
What does indexing have to do with robots.txt?
The status “Indexed, though blocked by robots.txt” can be confusing. This is because it is a common misconception that robots.txt directives can be used to control indexing – this is not accurate.
This status means that Google has indexed the page even if you have blocked it from accessing it, intentionally or by mistake.
Let me help you understand the relationship between robots.txt and the indexing process. It will make understanding the final solution easier.
How does discovery, crawling, and indexing work?
Before a page can be indexed, search engine crawlers must discover and crawl it first.
In the detection phase, the crawler knows that a particular URL exists. During crawling, Googlebot visits this URL and collects information about its contents. Only then does the URL go to the index and can be found among other search results.
What is a robots.txt file?
You can prevent certain URLs from being crawled using a robots.txt file. It’s a file that you can use to control how Googlebot crawls your website. When you put a Disallow command in it, Googlebot knows it can’t visit pages that this directive applies to.
But robots.txt does not control indexing.
Let’s explore what happens when Google gets mixed signals from your website, and indexing gets messy.
Reason for “Indexed, though blocked by robots.txt”
Sometimes Google decides to index a discovered page even though it can’t crawl it and understand its content.
In this scenario, Google is often motivated Too many links to the page blocked by robots.txt.
Links translate into a PageRank score. Calculated by Google to assess whether a particular page is important. The PageRank algorithm takes that into account Internal and external links.
When your links are messy and Google sees a disallowed page as having a high PageRank value, it may think that the page is important enough to put it in the index.
However, the index will only store an empty URL with no content information because the content has not been crawled.
Why is “Indexed, though blocked by robots.txt” bad for SEO?
The status “Indexed, though blocked by robots.txt” is a serious problem. It may sound relatively benign, but it can sabotage your SEO in two important ways.
Poor search appearance
If you blocked a certain page by mistake, “Indexed, though blocked by robots.txt” doesn’t mean you’re out of luck, Google has corrected the error.
Pages that are indexed without crawling will not look attractive when displayed in search results. Google will not be able to display:
- title tag (alternatively, it will automatically generate a title from the URL or information provided by pages that link to your page),
- profile description,
- Any additional information in the form of rich results.
Without these elements, users will not know what to expect after entering the page and may choose competing websites. Significantly lowers CTR.
Here’s an example – one of Google’s own products:
Google Jamboard is blocked from crawling, but with nearly 20,000 links from other websites (according to Ahrefs), Google is still indexing it.
While the page is arranged, it is displayed without any additional information. That’s because Google couldn’t crawl it and gather any information to display. It only displays the URL and canonical address based on what Google has found on other websites that link to Jamboard.
If you intentionally used the robots.txt Disallow command for a page, you don’t want users to find that page on Google. Suppose, for example, that you are still working on the content of this page, and it is not ready for public viewing.
But if the page is indexed, users may be able to find it, enter it and form a negative opinion about your website.
How to fix “Indexed, though blocked by robots.txt?”
You can find the status “Indexed, though blocked by robots.txt” at the bottom of the page indexing report in Google Search Console.
There you may see the “Improve search appearance” table.
After clicking Status, you’ll see a list of affected URLs and a chart showing how their number has changed over time.
The list can be filtered by URL or URL path. When you have a lot of URLs affected by this issue, and you just want to look at some parts of your website, use the pyramid icon on the right-hand side.
Before you start troubleshooting, consider whether the URLs in the list should already be indexed. Does it contain content that may be of value to your visitors?
When you want to index the page
If the page was blocked in robots.txt by mistake, you need to modify the file.
After you remove the Disallow directive that blocks crawling of your URL, it is likely that Googlebot will crawl it the next time it visits your website.
When you want to de-index the page
If the page contains information that you do not want to be shown to users who visit you via a search engine, you must indicate to Google that you do not want the page to be indexed.
The robots.txt file should not be used to control indexing
This file prevents Googlebot from crawling. Instead, use the noindex tag.
Remember, you need to allow Google to crawl your page to discover this HTML tag.
It is part of the page’s content. If you add a noindex tag but keep the page blocked in robots.txt, Google will not detect the tag. The page will remain “Indexed, though blocked by robots.txt”.
When Google crawls the page and sees the noindex tag, it will be dropped from the index. Google Search Console will display another indexing status when checking this URL.
Keep in mind that if you want to keep any page away from Google and its users, it is always the safest option to implement HTTP authentication on your server. This way, only users who are logged in can access it. This is necessary if you want to protect sensitive data, for example.
When you need a long-term solution
The above solutions will help you to address the “Indexed, though blocked by robots.txt” issue for a while. However, it may appear in relation to other pages in the future.
This case indicates that Your website may need comprehensive internal linking or improved backlink checking.
The status “Indexed, though blocked by robots.txt” applies to URLs that are not crawled but have been indexed. There is a similar situation in the page indexing report, forbidden by robots.txt, which applies to pages that are not crawled and indexed at the same time.
Let me show you the table again from the beginning to better illustrate this difference.
Blocking by robots.txt is usually less of a problem, while Indexer, although blocked by robots.txt, should be treated with high priority. However, if you want to take a closer look at the second case as well, you can check out our article on Blocked by robots.txt.
- The Disallow command in robots.txt prevents Google from crawling your page without indexing it.
- Having indexed and uncrawled pages is bad for your SEO.
- To fix Indexed, though blocked by robots.txt, you need to specify whether the affected pages should be visible in search and then:
- Edit your robots.txt file,
- Use the noindex meta tag if necessary.
- The status “Indexed, though blocked by robots.txt” may be a sign of serious internal linking and backlinking problems. Connect with Go Start Business to improve your links.