Davide Veneziano

Sample Contents: detect uncommon user-agent with the SAW

Discussion created by Davide Veneziano Employee on Dec 3, 2014

After presenting a sample piece of content that can be used in the reporting module against the NWDB to depict the different phase of the cyber kill chain (Sample Contents: cyber kill chain report), I want with this post to share the first of some advanced Hive examples that can be eventually used against the Security Analytics Warehouse.


As you may know, the reporting module offers a wizard to build your queries for the SAW but there are situations in which additional flexibility is required to achieve complex or advanced use cases. For those, Security Analytics offers its users whenever possible a way to interact in a so-called expert mode, provided the user has adequate skills to leverage those advanced functionalities.

I'm going now to present and detail the first of a series of examples written in the language (Hive) that the SAW is capable of understanding when dealing with it in Expert mode. As an additional disclaimer, this is intended for those with already a basic understanding of both SQL and how the Security Analytics WH stores and processes the data.

This first example is for identifying "uncommon" meta values by applying something which works pretty well in security which is the Least Frequency of Occurrence principle. Specifically we are looking for a user-agent which is showing up in a sample data set less than 100 times the average of all the other user agent strings.

The first step is to create an external table which is pointing to the location in our SAW which contains our dataset, specified by the LOCATION directive which contains the corresponding avro files (in this example everything from November 2014). We also need to define the schema, listing all the meta keys we will need for our query (it doesn't matter if the avro files contain more meta):

SET hive.mapred.supports.subdirectories=true;

SET mapred.input.dir.recursive=true;


CREATE EXTERNAL TABLE AllSessions ROW FORMAT SERDE 'org.apache.hadoop.hive.serde2.avro.AvroSerDe'

STORED AS INPUTFORMAT 'org.apache.hadoop.hive.ql.io.avro.AvroContainerInputFormat'

OUTPUTFORMAT 'org.apache.hadoop.hive.ql.io.avro.AvroContainerOutputFormat'

LOCATION '/rsasoc/v1/sessions/data/2014/11'

TBLPROPERTIES ('avro.schema.literal'='






        {"name": "ip_src",          "type": ["string","null"], "default": "null" },

        {"name": "client",          "type": ["string","null"], "default": "null" },

        {"name": "device_class",    "type": ["string","null"], "default": "null" },

        {"name": "device_type",     "type": ["string","null"], "default": "null" }





When executed, the table "AllSessions" will be declared but nothing will happen unless we will run a query against it. Also note that the first two SET directives will be used to instruct the engine to work recursively with the subdirectories which is something that we need.

Whenever there is a use case which would require any calculation with numbers like in this case, we need to produce aggregated views (with count, sum, etc.) upon which we can eventually apply our formula to identify uncommon keys.

So with this first step we are going to create the temporary table that will store the calculated average:


CREATE TABLE avg_table( avg_value float);


Now we need to populate this new temporary table with data, coming from the AllSessions table defined above. Specifically, we are adding a single entry with the calculated average of events considering all the user-agent strings (contained in client meta key)


( SELECT COUNT(*) AS CountOfItems FROM AllSessions GROUP BY client ) a;


To apply our formula, we need now to calculate, for each user-agent, the number of events. To do so, we create an additional temporary table called "item_table" to store the count for each meta value (user-agent):



CREATE TABLE item_table( item string, value float);


It is now the time to populate the table with data, leveraging an approach similar to the one described above:



SELECT client,COUNT(*)

FROM AllSessions WHERE client IS NOT NULL GROUP BY client;


The "item_table" contains now for each row the user-agent and the number of events in which it has been found.

Finally it is the time to apply our formula and to identify uncommon user-agent strings by joining the two tables and returning only the items whose count is less than our threshold:


SELECT item_table.item as UserAgent,item_table.value as NumberOfEvents

FROM item_table JOIN avg_table WHERE item_table.value < (avg_table.avg_value/100) ORDER BY item_table.value DESC LIMIT 10;


The statement above will return the user-agent which are occurring 100 times less than the average of all the other strings.

The full sample has been attached to this post.


Disclaimer: please DO NOT consider what is described and attached to this post as RSA official content. As any other unofficial material, it has to be tested in a controlled environment first and impacts have to be evaluated carefullybefore being promoted in production. Also note the content I publish is usually intended to prove a concept rather than being fully working in any environments. As such, handle it with care.