Monday 6 July 2020

Robots.txt file in AEM websites


When we think about AEM websites, SEO is one of the major consideration. To ensure the crawlers are crawling our website, we need to have sitemap.xml and a robots.txt which redirects the crawler to corresponding sitemap.xml

A robots.txt file lives at the root folder of the website. Below given the role of a robots.txt in any website. Robots.txt file acts as an entry point to any website and ensure the crawlers are accessing only the relevent items whcihwe have defined.

Click on image to see it big


robots.txt in AEM websites

Let us see how we can implement a robots.txt file in our AEM website. There are many ways to do this, but below is one of the easiest way to achieve the implementation.

Say we have multiple websites(multi-lingual) with language roots /en, /fr, /gb, /in

Let us see how we can enable robots.txt in our case.

Add robots.txt in Author

Login to the crxde and create a file called 'robots.txt' under path /content/dam/[sitename]
Ensure the following lines are added to the 'robots.txt' in Author of AEM instance and publish the robots.txt

#Any search crawler can crawl our site
User-agent: *

#Allow only below mentioned paths
Allow: /en/
Allow: /fr/
Allow: /gb/
Allow: /in/
#Disallow everything else
Disallow: /

#Crawl all sitemaps mentioned below
Sitemap: https://[sitename]/en/sitemap.xml
Sitemap: https://[sitename]/fr/sitemap.xml
Sitemap: https://[sitename]/gb/sitemap.xml
Sitemap: https://[sitename]/in/sitemap.xml

Now publish the robots.txt

Add OSGi configurations for url mapping

Now add below entry in OSGI console> configMgr  - 'Apache Sling Resource Resolver Factory'

Add below mapping for section 'URL Mappings'
/content/dam/sitename/robots.txt>/robots.txt$

Add rewrite rule/ allow access to  robots.txt via dispatcher
And allow the crawlers to access robots.txt via the dispatcher

Add allow rule for robots.txt in dispatcher
/0010 { /type "allow"  /url "/robots.txt"}

When you hit the www.[sitename]/robots.txt you should see the robots.txt file on public domain.

Now any search engine which tries to access our site will find the robots.txt and recognises, whether the crawler has got permission to crawl the site and what areas of the site has got crawl access.

Some sample usage of robots.txt is given below


# Disallow googlebot accessing example.com/directory1/... and example.com/directory2/...
# but allow access to subdirectories -> directory2/subdirectory1/...
# All other directories on the site are allowed by default.
User-agent: googlebot
Disallow: /directory1/
Disallow: /directory2/
Allow: /directory2/subdirectory1/

# Block the entire site from xyzcrawler.
User-agent: xyzcrawler
Disallow: /


Let me know if you find a better way to do this; via comments section.

2 comments:

  1. i am using this tutorial, but robots.txt is beeing downloaded by the browser. any ideia about what i doing wrong? tks

    ReplyDelete
  2. The content disposition is set as attachment vs inline. Add this dam path in your Apache Content Disposition Filter configuration.

    ReplyDelete