Search engine crawlers are occasionally blocked from reading content for a variety of reasons: the content isn’t intended to be public or searchable, or the content might affect SEO for the site. When this is need, a typical approach is to add an HTML header or a
.htaccess line to prevent crawling.
Nginx is capable of adding headers to requests, and as such can also be used to block crawlers:
add_header X-Robots-Tag "noindex, nofollow, nosnippet, noarchive";
If the Nginx configuration references config files with a glob like
include conf.d/*.conf, this line could be added in a single file like
But why would we block crawlers at the webserver level instead of at the content level?
Great question. Nginx is a very versatile and compact webserver, and it is often included in containerized stacks as a reverse proxy or a load balancer. If you were using a balancer like CausticLab/rgon-proxy you could add this configuration to the top-level Nginx instance, which would add this header to every route in the environment.
This is great for development or staging environments, where you want to demo content but don’t want bots to index your content prematurely!