How to spot a Spamer robot? This might appear a difficult question combined with hard and tough experience, however I must admit it could be quite easy at the end. You would wander how that's possible?
- Legitimate search engines search and index your site based on your permission, hence you have to submit your site to them. SpamBoots just do it.
- Legitimate search engines crawl and index your site without causing major distractions. Spam can create a heavy load on site as they download all content and sometimes parse it at once, hence connection might be resource costly
- Legitimate search engines can be controlled over the robots.txt file. Well, theoretically SpamBots too, but in practice that's different.
- Legitimate search engines and bots do hold sophisticated indexing algorithms, hence they look for new content based on the commands you've supplied them. SpamBots are intended to scrap all or big parts of your content, in most cases only for their own sole purposes.
We'll this all looks great in theory, but how t spot them in practice? For example, if SpamBots create a heavy load on site. As you'd imagine this might be a challenging task. You'd be tearing hair out of your head to figure out why sever is running on full throttle, in anxiety on what to do, what's causing it?
KEEP CALM and lets put theory in practice!
First of all if you server is hosted on cloud platform, log in to your web "console", it could be AWS console for example. Locate the machine that's causing you performance issues and try to look at it's monitoring graphs. You should probably spot traffic and server load peaks, some might be running for extended periods of time.
OK, now take a look at the times when load peaks occur and try to remember one of them. Since we know that server load bursting is caused by too high traffic, lets take a look at apache log files. Normally log files can be quite big - up to 1GB in size, so searching in them can be quite tedious. Luckily for our discovery we can use Grep Linux/Unix command line tool to for searching strings in files. OK, yeah but how to find a Spamer? Remember the load peak times you looked at earlier. This can be our hint!
Apache web server uses Common log format in their's logging files. One of the things that log format supports is timestamp - hence every request keeps a track of daytime when it occurred.
So we are ready to narrow down our search;
cd /var/log/apache2/ cat www.mysite.com.access.log | grep "29/Dec/2014:14"
The above just means that we are doing an on screen print of log file contents, and using a pattern search on the output. In the above example we are trying to look for all log events that happened one the 29th of December 2014 over the times from 2:01 - 2:59 in the afternoon. Now take a look on the data what you seen on the screen. You could get many lines in output - all depends on the amount of traffic your site is getting.
Even though Spam robots try to hide themselves as much as possible, many of them still identify themselves. Normally their identity comes from User Agent string which is also getting logged.
What is user agent? According to Wikipedia page
In computing, a user agent is software (a software agent) that is acting on behalf of a user. For example, an email reader is a mail user agent, and in the Session Initiation Protocol (SIP), the term user agent refers to both end points of a communications session.
Basically in usual case - as we are on the web, this should be browser from where we are accessing our site. In that case user agent string would look like "Mozilla/5.0 (Linux; U; Android 4.1.2; en-gb; GT-I8190N Build/JZO54K) AppleWebKit/534.30 (KHTML, like Gecko) Version/4.0 Mobile Safari/534.30", which Looks like Mozilla browser on Android phone.
Now try to spot something which does not look like a normal web browser! In our case it was AhrefsBot which was causing massive disruptions on our site. According to their website you could block them via an entry in robots.txt file, however a little more googling resulted me over this resource AhrefsBot - SEO Spybots, which suggested blocking this particular Spamer on an ip level
cat www.mysite.com.access.log | grep -c "AhrefsBot"
In code above I am using -c argument which instead of printing out all textual matches just gives their count.
In result we temporarily blocked all traffic from this user Agent (as indentifier of a Spamer robot) for our customer due to reasons of distracting normal site's load times. If this tool is used by legitimate businesses for SEO optimization and link analysis we really hope that the guys from AhrefsBot will get in touch.
If you are interested in our system administration support and consultancy, our you'd like to find out more on how to optimize your web applications performance, do not hesitate to get in touch!