RSA Admin

How to Detect Web Scrapers and Crawlers

Discussion created by RSA Admin Employee on Aug 28, 2012
Latest reply on Aug 28, 2012 by RSA Admin

We had a customer come to us asking to see if we could detect when an attacker tried to take a mirror of their webserver or detect when someone was running a web-based vulnerability scan against their webserver in the monitored environment.  We don't publish native content for this because the solution requires customization for the customer environment.  But it is pretty easy to do.


  • Step 1:  Create a decoder rule called 'site-crawl attempt' where "query count 50-u"
  • Step 2:  Customize for your environment.  The problem is that this needs to be defined in a specific direction.  Outbound connections to google maps, for instance, will fire this easily.  If you are worried about a single web server, the rule should be:
    • && query count 50-u

Or if it is against a cluster of webservers you can use the alias host.  Ive even had one customer that used

    • org.dst=themselves && query count 50-u 

and it works just fine.


  • Step 3:  Once you get the Investigator rule in place and validated that it is working (you might have to fiddle with the value of 50- higher or lower depending on the type of server), create an informer report.  The report should look for the user agent strings of the inbound connections that are running heavy queries.


In the query field you look for 'client'

In the where field you look for alert='site-crawl attempt'


The results will show you all of the normal bots and crawlers like google and yandex and yahoo in addition to targeted vulnerability scanning attempts.  You can even chart this out in an Informer graph.  Ive done exactly this many times in the field.


Another good use of this Investigator and Informer rule is to look for inbound web queries that have errors published in the error index. In this case, the Informer report would look for 'error'

In the where field you look for alert='site-crawl attempt' && error begins '4'

This will show you all of the 404 errors, which typically show up in a nessus scan or similar.


If you look for errors beginning '2' you will see successful queries that found an object.