About MJ12Bot

Bot Type
Good crawler
(always identifies itself)
IP Range
Distributed, Worldwide
Obeys Robots.txt
Yes
Obeys Crawl Delay
Yes
Data served at
Majestic.com

Majestic is a UK based specialist search engine used by hundreds of thousands of businesses in 13 languages and over 60 countries to paint a map of the Internet independent of the consumer based search engines. Majestic also powers other legitimate technologies that help to understand the continually changing fabric of the web.

Web site owners can see data about their own websites on majestic.com.

MJ12Bot does not currently cache web content or personal data. Instead it maps the link relationships between websites to build a search engine. This data is available to technologies and the public, either by searching for a keyword or a website at Majestic. Details about the community project behind the crawlers are at Majestic12.co.uk.

What is MJ12bot doing on my site(s)?

We spider the Web for the purpose of building a search engine with a fast and efficient downloadable distributed crawler that enables people with broadband connections to help contribute to, what we hope, will become the biggest search engine in the world. Production of a full text search engine at Majestic-12 is currently in the research phase, funded in part by the commercialisation of research at Majestic.

What happens to the crawled data?

Crawled data (currently only a web graph of links) is added to the largest public backlinks search engine index that we maintain as a dedicated tool called Site Explorer. Learn about your own backlinks from the extensive backlinks index.

My web host is blocking your bot, why?

Some ISPs and badly configured firewalls may stop MJ12Bot from crawling your website. This is usually because the ISP or Firewall does not understand that in doing so, they are blocking genuine visitors to your website at a later date. Some also do this to minimize bandwidth. In these instances, some ISPs can remove the block for all their users when they understand the purpose of the bot. If your ISP will not allow our bot, we recommend that you consider moving ISPs.

Why do you keep crawling 404 or 301 pages?

We have a long memory and want to ensure that temporary errors, website down pages or other temporary changes to sites do not cause irreparable changes to your site profile when they shouldn't. Also if there are still links to these pages they will continue to be found and followed. Google have published a statement since they are also asked this question, their reason is of course the same as ours and their answer can be found here: Google 404 policy.

You are crawling links with rel=nofollow

This is a common misunderstanding of the (perhaps poorly named) nofollow attribute. Google introduced the 'rel=nofollow' attribute in 2005 stating that links so marked would not influence the target's Pagerank, it does not stop the crawler from visiting the target page, this becomes particularly obvious if the target page has several links to it, some may have this attribute, some may not. If you wish to stop bots from crawling a page then the robots.txt file should be used to disallow the target page.

More information on rel=nofollow can be found here: Wikipedia Nofollow

How can I block MJ12bot?

MJ12bot adheres to the robots.txt standard. If you want the bot to prevent website from being crawled then add the following text to your robots.txt:

User-agent: MJ12bot
Disallow: /

Please do not block our bot via IP in htaccess - we do not use any consecutive IP blocks as we are a community based distributed crawler. Please always make sure the bot can actually retrieve robots.txt itself. If it can't then it will assume that it is okay to crawl your site.

If you have reason to believe that MJ12bot did NOT obey your robots.txt commands, then please let us know via email: bot@majestic12.co.uk. Please provide URL to your website and log entries showing bot trying to retrieve pages that it was not supposed to.

What commands in robots.txt does MJ12bot support?

The current crawler supports the following non-standard extensions to robots.txt:

  • Crawl-Delay for up to 20 seconds (higher values will be rounded down to the maximum our bot supports)
  • Redirects (within the same site) when trying to fetch robots.txt
  • Simple pattern matching in Disallow directives compatible with Yahoo's wildcard specification
  • Allow directives can override Disallow if they are more specific (longer in length)
  • Certain failures to fetch robots.txt such as 403 Forbidden will be treated as blanket disallow directive

Why did my robots.txt block not work on MJ12bot?

We are keen to see any reports of potential violations of robots.txt by MJ12bot.

There are a number of false positives raised - this can be a useful checklist when configuring a web server:

  1. Off site redirects when requesting robots.txt - MJ12Bot follows redirects, but only on the same domain. The ideal is for robots.txt to be available at "/robots.txt" as specified in the standard.
  2. Multiple domains running on the same server. Modern webservers such as Apache can log accesses to a number of domains to one file - this can cause confusion when attempting to see what webserver was accessed at which point. You may wish to consider adding domain information to the access log, or splitting access logs on a per domain basis
  3. Robots.txt out of sync with developer copy. We have had complaints that MJ12Bot has disobeyed robots.txt - only to find out that the developer was testing against a development server, which was not in-sync with the live version

How can I slow down MJ12bot?

You can easily slow down bot by adding the following to your robots.txt file:

User-Agent: MJ12bot
Crawl-Delay: 5

Crawl-Delay should be an integer number and it signifies number of seconds of wait between requests. MJ12bot will make an up to 20 seconds delay between requests to your site - note however that while it is unlikely, it is still possible your site may have been crawled from multiple MJ12bots at the same time. Making high Crawl-Delay should minimise impact on your site. This Crawl-Delay parameter will also be active if it was used for * wildcard.

If you have an MJ12bot section, this section will be taken in preference over the * wildcard section, not in addition to it, so if you have a crawl-delay in your * wildcard section, this must be copied to the MJ12bot section too, if this bot specific section exists for it to be conveyed to our bot.

What are the current versions of MJ12bot?

Current v1.4.x series operating versions of MJ12bot are:

  • v1.4.8 (Current - April 2017)
  • v1.4.7 (Being Replaced with 1.4.8 - End 2018)
  • v1.4.6 (Being Replaced with 1.4.7 - June 2016)
  • v1.4.5 (Phased out - June 2016)
  • v1.4.4 (phased out May 2014)

How do I verify requests are from you?

As a community project unfortunately we don't have the ability to restrict our bots to a limited number of IP addresses, as some of our better funded counterparts do. However we can send a pre-arranged ident string with all requests to your site. This can be sent as part of the http or https headers in the 'CRAWLER-IDENT' field, or as part of the User-Agent string. This string will not be shared by us with anyone else or send it to any other domain or subdomain than you request so requests including this string can be validated as coming from our network. To make use of this facility please contact bot@majestic12.co.uk with details of your site and the ident you would like sending, or if you prefer we can generate a random ident string for you.

If you have not been satisfied with the information above then feel free to contact us: bot@majestic12.co.uk

Majestic-12 Ltd

Faraday Wharf, Holt Street, Birmingham, West Midlands, B7 4BB, UK