What is MJ12bot doing on my site(s)?
We spider the Web for the purpose of building a search engine with a fast and efficient downloadable distributed crawler that enables people with broadband connections to help contribute to, what we hope, will become the biggest search engine in the world. Production of a full text search engine at Majestic-12 is currently in the research phase, funded in part by the commercialisation of research at Majestic.
What happens to the crawled data?
Crawled data (currently only a web graph of links) is added to the largest public backlinks search engine index that we maintain as a dedicated tool called Site Explorer. All webmasters can obtain full free data on backlinks by verifying ownership of their site - learn about your own backlinks from the extensive backlinks index.
My web host is blocking your bot, why?
Some ISPs and badly configured firewalls may stop MJ12Bot from crawling your website. This is usually because the ISP or Firewall does not understand that in doing so, they are blocking genuine visitors to your website at a later date. Some also do this to minimize bandwidth. In these instances, some ISPs can remove the block for all their users when they understand the purpose of the bot. If your ISP will not allow our bot, we recommend that you consider moving ISPs.
Why do you keep crawling 404 or 301 pages?
We have a long memory and want to ensure that temporary errors, website down pages or other temporary changes to sites do not cause irreparable changes to your site profile when they shouldn't. Also if there are still links to these pages they will continue to be found and followed. Google have published a statement since they are also asked this question, their reason is of course the same as ours and their answer can be found here: Google 404 policy.
You are crawling links with rel=nofollow
This is a common misunderstanding of the (perhaps poorly named) nofollow attribute. Google introduced the 'rel=nofollow' attribute in 2005 stating that links so marked would not influence the target's Pagerank, it does not stop the crawler from visiting the target page, this becomes particularly obvious if the target page has several links to it, some may have this attribute, some may not. If you wish to stop bots from crawling a page then the robots.txt file should be used to disallow the target page.
More information on rel=nofollow can be found here: Wikipedia Nofollow
How can I block MJ12bot?
MJ12bot adheres to the robots.txt standard. If you want the bot to prevent website from being crawled then add the following text to your robots.txt:
Please do not block our bot via IP in htaccess - we do not use any consecutive IP blocks as we are a community based distributed crawler. Please always make sure the bot can actually retrieve robots.txt itself. If it can't then it will assume that it is okay to crawl your site.
If you have reason to believe that MJ12bot did NOT obey your robots.txt commands, then please let us know via email: email@example.com. Please provide URL to your website and log entries showing bot trying to retrieve pages that it was not supposed to.
What commands in robots.txt does MJ12bot support?
The current crawler supports the following non-standard extensions to robots.txt:
- Crawl-Delay for up to 20 seconds (higher values will be rounded down to the maximum our bot supports)
- Redirects (within the same site) when trying to fetch robots.txt
- Simple pattern matching in Disallow directives compatible with Yahoo's wildcard specification
- Allow directives can override Disallow if they are more specific (longer in length)
- Certain failures to fetch robots.txt such as 403 Forbidden will be treated as blanket disallow directive
Why did my robots.txt block not work on MJ12bot?
We are keen to see any reports of potential violations of robots.txt by MJ12bot.
There are a number of false positives raised - this can be a useful checklist when configuring a web server:
- Off site redirects when requesting robots.txt - MJ12Bot follows redirects, but only on the same domain. The ideal is for robots.txt to be available at "/robots.txt" as specified in the standard.
- Multiple domains running on the same server. Modern webservers such as Apache can log accesses to a number of domains to one file - this can cause confusion when attempting to see what webserver was accessed at which point. You may wish to consider adding domain information to the access log, or splitting access logs on a per domain basis
- Robots.txt out of sync with developer copy. We have had complaints that MJ12Bot has disobeyed robots.txt - only to find out that the developer was testing against a development server, which was not in-sync with the live version
How can I slow down MJ12bot?
You can easily slow down bot by adding the following to your robots.txt file:
Crawl-Delay should be an integer number and it signifies number of seconds of wait between requests. MJ12bot will make an up to 20 seconds delay between requests to your site - note however that while it is unlikely, it is still possible your site may have been crawled from multiple MJ12bots at the same time. Making high Crawl-Delay should minimise impact on your site. This Crawl-Delay parameter will also be active if it was used for * wildcard.
If our bot detects that you used Crawl-Delay for any other bot then it will automatically crawl slower even though MJ12bot specifically was not asked to do so.
What are the current versions of MJ12bot?
Current v1.4.x series operating versions of MJ12bot are:
- v1.4.8 (Current - April 2017)
- v1.4.7 (Being Replaced with 1.4.8 - End 2018)
- v1.4.6 (Being Replaced with 1.4.7 - June 2016)
- v1.4.5 (Phased out - June 2016)
- v1.4.4 (phased out May 2014)
How do I verify requests are from you?
As a community project unfortunately we don't have the ability to restrict our bots to a limited number of IP addresses, as some of our better funded counterparts do. However we can send a pre-arranged ident string with all requests to your site. This can be sent as part of the http or https headers in the 'CRAWLER-IDENT' field, or as part of the User-Agent string. This string will not be shared by us with anyone else or send it to any other domain or subdomain than you request so requests including this string can be validated as coming from our network. To make use of this facility please contact firstname.lastname@example.org with details of your site and the ident you would like sending, or if you prefer we can generate a random ident string for you.