The latest robots txt update bids farewell to
undocumented and unsupported rules
Yesterday, Google announced that it is open-sourcing
its production robots.txt parser.
The question that stood out was- why does the code
not include a code handler for other rules like crawl-delay?
In the draft published yesterday, Google provided an
extensible architecture for rules that are not part of the standard.
In simple terms, a crawler had the freedom
to support their own line like “unicorns: allowed”.
You could see this at the Google Robots.txt Parser
and Matcher Library on Github.
Google had never documented rules such as
crawl-delay, nofollow, and noindex.
Also, their minimal usage in relation to Google bot
hurt websites’ presence in the SERPs.
Starting September 1, 2019, Google will be retiring all code
that handles unsupported and unpublished rules.
This includes the noindex too.
What about noindex indexing in the robots.txt file?
Google has taken this into account and provided
alternative options as part of the robots txt update.
Robots Txt Update
Noindex in robots meta tags
You can use this directive to remove URLs
from the index when crawling is allowed.
It works well with both HTTP response headers and HTML.
404 and 410 HTTP
The webpage does not exist?
Both these status codes will remove the URLs from
Google’s index once they are crawled and processed.
If you hide a web page behind a login,
it will be dropped from Google’s index.
Or, you can use markup to indicate
subscription or paywalled content.
If you block a page, it will not be indexed.
What if the blocked URL has links from other pages?
Google says that it will make such pages
less visible in the future.
Search Console Remove URL tool
You can use this tool to quickly remove a URL
temporarily from Google’s search results.
What do you think of this robots txt update?
Leave your comments below.