Reporting: Warehouse DB Advanced Rules

Document created by RSA Information Design and Development on Nov 23, 2016Last modified by RSA Information Design and Development on Feb 10, 2017
Version 5Show Document
  • View in full screen mode
  

This topic provides examples of Warehouse Data source rules. You can define Warehouse DB rules using HIVE queries. You can define simple and advanced rules for the Warehouse Data source using the following modes:

  • Default Mode 
  • Expert Mode

Advanced rules are defined using complex HIVE queries created using the clauses DROP, CREATE, and so on. Unlike simple rules, we always insert the results into a table. For more information on Advanced HIVE query language, see HIVE language manual.

The following examples illustrate advanced rules in the expert mode:

  • Hourly, daily, weekly, and monthly report
  • Table partition based on location report
  • Join logs and sessions based on unique_id report
  • List report
  • Parameterized report
  • Partition based table with multiple locations
  • Automated partition using custom function (10.5.1 onwards) 

General Syntax of an Advanced Rule

The following figure shows how to define an advanced query.

ExpertRule-TimeVar-input_hourly_103SP3.jpg

The following syntax is an example of an advanced query:

DROP Table IF EXISTS sessions21022014;

CREATE External TABLE sessions21022014

ROW FORMAT SERDE ‘org.apache.hadoop.hive.serde2.avro.AvroSerDe’

STORED AS INPUTFORMAT

'org.apache.hadoop.hive.ql.io.avro.AvroContainerInputFormat'
OUTPUTFORMAT 'org.apache.hadoop.hive.ql.io.avro.AvroContainerOutputFormat'

LOCATION '/RSA/rsasoc/v1/sessions/data/2013/12/2’

TBLPROPERTIES('avro.schema.literal'='
{

"type":"record";

"name":"nextgen";

"fields":
[
{"name":"time",  "type":["long", "null"], "default":"null"},

{"name":"threat_category",  "type":["string", "null"], "default":"null"},

{"name":"ip_src",  "type":["string", "null"], "default":"null"},

{"name":"device_class",  "type":["string", "null"], "default":"null"}
]
'};

set mapred.input.dir.recursive=true;
set hive.mapred.supports.subdirectories=true;

select  from_unixtime(time), threat_category, ip.src from  time_variable where threat_category is not NULL and time >= ${report_starttime} and time <= ${report_endtime};

Note: Reporting Engine treats a line beginning with <hyphen> <hyphen> as a comment in Expert Warehouse Rule.
For example,
set mapred.input.dir.recursive=true;
-- This is an Expert comment
set hive.mapred.supports.subdirectories=true;

The general syntax of an advanced query is as explained below:

  1. Drop and create an external table, and then format the row:
    Firstly, we drop the table, if the table already exists and create an external table sessions21022014
    DROP TABLE IF EXISTS sessions21022014
    CREATE EXTERNAL TABLE sessions21022014

    Note:  You must create an external table only if you are using an other table. For example, if you are using an other table apart fromsessions21022014 then you must drop the table and create an external table.

    Then, specify the row format as Avro.SerDe interface to instruct HIVE as to how a record is to be processed. Avro.SerDe allows you to read or write Avro data as HIVE tables and store them as input format and output format.
    ROW FORMAT SERDE 'org.apache.hadoop.hive.serde2.avro.Avro.SerDe'
    STORED AS INPUTFORMAT 'org.apache.hadoop.hive.ql.io.avro.AvroContainerInputFormat'
    OUTPUTFORMAT 'org.apache.hadoop.hive.ql.io.avro.AvroContainerOutputFormat'

  1. Specify the HDFS location:
    Secondly, you must specify the HDFS location '/RSA/rsasoc/v1/sessions/data/2013/12/2' from where the data is queried before executing the HIVE statements.The location parameter specifies the data to be fetched depending on the date input provided. This is a variable parameter hence you can fetch values depending on the date entered.
  2. Define the table schema:
    Thirdly, you define the table schema by defining columns with a specific data type and default value as 'null'.
    TBLPROPERTIES('avro.schema.literal'='
    {"type":"record";
    "name":"nextgen";
    "fields":
    [
    {"name":"ip_src",  "type":["string", "null"], "default":"null"}
    ]
    '};
  3. Import data from directory which contains sub directories:
    Then, you must enable HIVE to recursively scan all sub-directories and fetch all the data from all sub-directories.
    set mapred.input.dir.recursive=true;
    set hive.mapred.supports.subdirectories=true;
  4. Fetch data from the HIVE table:
    Once you execute all the above statements, you can query the database with the HIVE query select clause to fetch the data from the HIVE table.

Hourly, Daily, Weekly, and Monthly Report

In these example rules, you can create various reports for December 02, 2013 (as in the below figure). The date variable in the LOCATION statement can be altered, depending on which you can create an hourly, daily, weekly, and monthly report.

Hourly Report

In this example rule, you can create an houly report for December 02, 2013. The LOCATION statement can be altered to generate an hourly report.

LOCATION '/RSA/rsasoc/v1/sessions/data/2013/12/2' - the date input (2013/12/2) indicates year/month/day. The entire data for 02 December, 2013 is retrieved using this location statement. 

Expert_Rule_hourly_report_103SP3.png

The result set of this query would be an hourly report.

Daily Report

In this example rule, you can create a daily report for December 2013. The LOCATION statement can be altered to generate a daily report.

LOCATION '/RSA/rsasoc/v1/sessions/data/2013/12' - the date input (2013/12) indicates year/month. The entire data for December, 2013 is retrieved using this location statement.

Expert_Rule_daily_report_103SP3.png

The resultset of this query would be a daily report.

Weekly Report

In this example rule, you can create a weekly report for December 2013. The LOCATION statement can be altered to generate a weekly report.

LOCATION '/RSA/rsasoc/v1/sessions/data/2013/12' - the date input (2013/12) indicates year/month. The entire data for December, 2013 is retrieved using this location statement.

Expert_Rule_weekly_report_103SP3.png

The result set of this query would be a weekly report.

Monthly Report

In this example rule, you can create a monthly report for the year 2013. The LOCATION statement can be altered to generate a monthly report.

LOCATION '/RSA/rsasoc/v1/sessions/data/2013' - the date input (2013) indicates year. The entire data for the year 2013 is retrieved using this location statement.

Expert_Rule_monthly_report_103SP3.png

The result set of this query would be a monthly report.

For more information on LOCATION definition, see Specify the HDFS location in the General Syntax of an Advanced Rule section.

You must perform the following steps in sequence to view the resultset of an advanced rule:

  1. Define an Advanced Rule
  2. Add an advanced rule to a Report
  3. Schedule a Report
  4. View a scheduled Report

The following figure shows how to define an advanced rule.

ExpertRule-TimeVar-input_103SP3.jpg

The following figure shows how to add an advanced rule to a report (For example, AllEventCategories).

add_expert_rule_to_report.png

The following figure shows how to schedule a daily report.

Expert_Rule_daily_report_103SP3.png

If you want to generate a report for a specific time range, you need to manually define the time range in the query using the following two variables:

${report_starttime} - The starting time of the range in seconds.
${report_endtime} - The ending time of the range in seconds.

For example, SELECT from_unixtime(time), threat_category, ip.src FROM time_variable WHERE threat_category is not NULL AND time >= ${report_starttime} AND time <= ${report_endtime};

The following figure shows the result set of scheduling a daily report.

TimeVar-output_103SP3.JPG

Table Partition Based on Location Report

In this example rule, you can create a table partition based on location. Each table can have one or more partition keys which determines how the data is stored. For example, a country_dst of type STRING and an ip_src of type STRING. Each unique value of the partition keys defines a partition of the table.

In the example provided, we execute a HIVE query to fetch destination country and IP address of source from the sessions05032014 table and group the result set by these fields.

This rule provides information about the table created, row formatted, location (directory path) for avro data files in Warehouse, and returns a result set as per the HIVE query to indicate that the query returned a result set. For more information on these statements, see the General Syntax of an Advanced Rule section.

ExpertRule-GroupByLocation_103SP3.jpg

The following figure shows the result set of creating a table partition based on location report.

TablePartition_103SP3.jpg

Join Logs and Sessions Based on unique_id Report

In this example rule, you can create a rule to join logs and sessions table to fetch unique_id, IP address of source and destination, and packet ID based on unique_id.

In the example provided, we execute a HIVE query to fetch certain fields from both the sessions_table and logs_table by performing a join based on the 'unique_id' field.

This rule provides information about the table created, row formatted, location (directory path) for avro data files in Warehouse, and returns a result set as per the HIVE query to indicate that the query returned a result set. For more information on these statements, see the General Syntax of an Advanced Rule section.

ExpertRule - Join_103SP3.png

The following figure shows the result set of joining logs and sessions table based on unique_id.

Join-output_103SP3.jpg

List Report

In this example rule, you can create a List report to fetch IP address of source and destination, and device type from the lists_test table where device type is not null and IP address of source is fetched from the appropriate event list.

This rule provides information about the table created, row formatted, location (directory path) for avro data files in Warehouse, and returns a result set as per the HIVE query to indicate that the query returned a result set. For more information on these statements, see the General Syntax of an Advanced Rule section.

ExpertRule-Lists_103SP3.JPG

The following figure shows the result set of executing a list report.

LISTS-output_103SP3.jpg

Parameterized Report

In this example rule, you can create a rule to fetch IP addresses of source and destination, and device type from the runtime_variable table based on the specified run time variable ${EnterIPDestination}. At run time, you are prompted to enter a value for the IP address of destination ip_dst. Based on the value entered, the result set is displayed.

This rule provides information about the table created, row formatted, location (directory path) for avro data files in Warehouse, and returns a result set as per the HIVE query to indicate that the query returned a result set. For more information on these statements, see the General Syntax of an Advanced Rule section.

ExpertRule-RunTimeVar_103SP3.JPG

The following figure shows the result set of executing a parameterized report.

Run_Time_Variable-output_103SP3.jpg

Partition Based Table with Multiple Locations

The following is an example of partition based table with multiple locations:

set mapred.input.dir.recursive=true;
set hive.mapred.supports.subdirectories=true;
DROP TABLE IF EXISTS AVRO_COUNT;
CREATE EXTERNAL TABLE AVRO_COUNT
PARTITIONED BY (partition_id int)
ROW FORMAT SERDE 'org.apache.hadoop.hive.serde2.avro.AvroSerDe'
WITH SERDEPROPERTIES (
'avro.schema.literal'='{
"name": "my_record", "type": "record",
"fields": [
{"name":"sessionid", "type":["null", "long"], "default" : null},
{"name":"time", "type":["null", "long"], "default" : null}
]}'
)
STORED AS
INPUTFORMAT
'org.apache.hadoop.hive.ql.io.avro.AvroContainerInputFormat'
OUTPUTFORMAT
'org.apache.hadoop.hive.ql.io.avro.AvroContainerOutputFormat';
ALTER TABLE AVRO_COUNT ADD PARTITION(partition_id=0) LOCATION '/rsasoc/v1/sessions/data/2015/07/22/8';
ALTER TABLE AVRO_COUNT ADD PARTITION(partition_id=1) LOCATION '/rsasoc/v1/sessions/data/2015/07/22/9';
ALTER TABLE AVRO_COUNT ADD PARTITION(partition_id=2) LOCATION '/rsasoc/v1/sessions/data/2015/07/22/10/';
ALTER TABLE AVRO_COUNT ADD PARTITION(partition_id=3) LOCATION '/rsasoc/v1/sessions/data/2015/07/22/11/';
ALTER TABLE AVRO_COUNT ADD PARTITION(partition_id=4) LOCATION '/rsasoc/v1/sessions/data/2015/07/22/12/';
SELECT COUNT(*) as TOTAL FROM AVRO_COUNT WHERE time >= ${report_starttime} AND time
<= ${report_endtime};

The partition based table with multiple location is as explained below:

  1. Enable HIVE to recursively scan all sub-directories and read all the data from the sub-directories.
    set mapred.input.dir.recursive=true;
    set hive.mapred.supports.subdirectories=true;

  2. Drop and create an external table, and then format the rows:
    DROP TABLE IF EXISTS AVRO_COUNT;
    CREATE EXTERNAL TABLE AVRO_COUNT
    PARTITIONED BY (partition_id int)
    ROW FORMAT SERDE 'org.apache.hadoop.hive.serde2.avro.AvroSerDe'
    WITH SERDEPROPERTIES (
    'avro.schema.literal'='{
    "name": "my_record", "type": "record",
    "fields": [
    {"name":"sessionid", "type":["null", "long"], "default" : null},
    {"name":"time", "type":["null", "long"], "default" : null}
    ]}'
    )
    STORED AS
    INPUTFORMAT
    'org.apache.hadoop.hive.ql.io.avro.AvroContainerInputFormat'
    OUTPUTFORMAT
    'org.apache.hadoop.hive.ql.io.avro.AvroContainerOutputFormat';

Note:  You must create an external table only if you are using any other table. For example, if you are using any other table apart from AVRO_COUNT then you must drop the table and create an external table.

Note: Points to remember when you create a table:
    - Dropping a 'non-external' table deletes the data. 
    - The  table is partitioned on a single column called partition_id and this is the standard column for Reporting       Engine.
    - The default value of any column is null as the AVRO file may not contain the specified column.
    - The column names should be in the lowercase as HIVE is case insensitive but AVRO is case sensitive.
    - You must specify avro.schema.literal in the SERDEPROPERTIES .

For more information on the rule syntax, please refer to Apache HIVE. 

  1. Add partitions:
    Once you define a table, you must specify the HDFS locations from where the data needs to be queried before you execute the HIVE statements.The location parameter specifies the data to be fetched depending on the specified date. The data is spread across multiple locations or directories in HDFS. For each location you need to add a partition with unique values assigned to the partition column. The locations can be any directory in the HDFS
    ALTER TABLE AVRO_COUNT ADD PARTITION(partition_id=0) LOCATION '/rsasoc/v1/sessions/data/2015/07/22/8';
    ALTER TABLE AVRO_COUNT ADD PARTITION(partition_id=1) LOCATION '/rsasoc/v1/sessions/data/2015/07/22/9';
    ALTER TABLE AVRO_COUNT ADD PARTITION(partition_id=2) LOCATION '/rsasoc/v1/sessions/data/2015/07/22/10/';
    ALTER TABLE AVRO_COUNT ADD PARTITION(partition_id=3) LOCATION '/rsasoc/v1/sessions/data/2015/07/22/11/';
    ALTER TABLE AVRO_COUNT ADD PARTITION(partition_id=4) LOCATION '/rsasoc/v1/sessions/data/2015/07/22/12/';

Note: HIVE reads each file in these locations as AVRO. Incase if there is a non-AVRO file available in one of these locations then the query may fail.

  1. Run the query
    SELECT COUNT(*) as TOTAL FROM AVRO_COUNT WHERE time >= ${report_starttime} AND time
    <= ${report_endtime};

    When a table is created, you can execute specific queries to filter the data. For example, after you create the table you can filter the data as shown in the below examples: 
    Sessions with a specific Source IP Address:
    SELECT * FROM AVRO_COUNT WHERE time >= ${report_starttime} AND time <= ${report_endtime}  AND ip_src = '127.0.0.1';
    Group by based on user destination:
    SELECT * FROM AVRO_COUNT WHERE time >= ${report_starttime} AND time <= ${report_endtime}  GROUP BY usr_dst;

Automated Partition using Custom function

In 10.5.1, you can use the custom function to automate the addition of partitions to a user defined table in the expert mode.

General syntax

RE WH CUSTOM ADDPARTITIONS(table, namespace, rollup, [starttime, endtime])

The following table describes the custom function syntax:

                      
S.NoNameDescription
1tableThe table name for which the partition has to be added.
2namespaceThe namespace can be sessions or logs.
3rollup

This value determines the level of directory path to be included in partitions. The
value can be HOUR, DAY, or MINUTE. If Warehouse Connector is configured for Day rollup, setting this value as HOUR produces ZERO results. The number and location of each partition is based on time range used to run the rule and the rollup value.

4(Optional)  starttime,  endtime

To generate partitions for a specific time range other than the time range mentioned in the rule, you must specify the starttime and endtime in Epoch Seconds.

Note: Expressions are not supported for the starttime and endtime.

The custom function is invoked when Reporting Engine executes the rule either during test rule or scheduled report. While running a expert rule, whenever Reporting Engine identifies the function declaration, it extracts the required
arguments and insert number of ADD PARTITION HiveQL statements and executes them on the Hive Server.

The location and directory structure is determined by the argument passed in the rule and the Hive datasource configuration
in Reporting Engine. The number of partitions depends on the rollup specified and the time range used while executing the rule. For example, with the rollup as HOUR and the time range as PAST 2 Days results in 48 partitions for 48 Hours while
with the rollup as DAY, Reporting Engine creates 2 partitions, one for each day.

The partition query is generated by the Syntax Template as set in Reporting Engine's Hive Configuration attribute AlterTableTemplate.

Note: By default, this function starts adding partitions to a table with partition id from 0 to N-1. Hence this requires that the table must be partitioned by single integer column named partition id.

The following is an example of automated partition using custom function:
set mapred.input.dir.recursive=true;
set hive.mapred.supports.subdirectories=true;
DROP TABLE IF EXISTS AVRO_COUNT;

CREATE EXTERNAL TABLE AVRO_COUNT
PARTITIONED BY (partition_id int)
ROW FORMAT SERDE 'org.apache.hadoop.hive.serde2.avro.AvroSerDe'
WITH SERDEPROPERTIES (
'avro.schema.literal'='{ 
  "name": "my_record", "type": "record", 
  "fields": [ 
    {"name":"sessionid", "type":["null", "long"], "default" : null} 
   ,{"name":"time", "type":[ "null" , "long"], "default" : null} 
   ,{"name":"unique_id", "type":["null", "string"], "default" : null} 
   ]}'
)
STORED AS 
INPUTFORMAT 
'org.apache.hadoop.hive.ql.io.avro.AvroContainerInputFormat'
OUTPUTFORMAT 
'org.apache.hadoop.hive.ql.io.avro.AvroContainerOutputFormat';

RE_WH_CUSTOM_ADDPARTITIONS(AVRO_COUNT, 'sessions', 'DAY');
SELECT COUNT(*) as TotalSessions FROM AVRO_COUNT
WHERE time >= ${report_starttime} AND time <= ${report_endtime};

 

Creating Custom Tables Report

In 10.6.1, you can use and create Custom Tables on the Hive Server. Reporting Engine supports running queries on user defined tables and the ability to create a new table from a Single Rule output. When this feature is enabled in the Warehouse Rule Builder UI, user can see a list of custom tables available in Hive Server.

To enable this feature set customTablesEnabled to TRUE by navigating to Reporting Engine -> Explore ->Hive Config.

Creating Custom Table from Regular Rules

To schedule a report which contains a single SAW rule, a new text input with a Warehouse CTAS Name is added.The user can now specify a Custom Table name that will be created out of the output of the rule in Report.

Note: This feature is available only if the Report contains a single SAW rule on the Schedule page. Otherwise, this option is hidden.

The process to use the feature is explained below:

  1. Create a rule to filter with data in SAW.
  2. Create a Report with the above rule.

  3. Create a Schedule and enter the CTAS Table Name.

  4. Run the Report and Reporting Engine will create the Result Summary as below for the Schedule.

  5. On the next schema refresh or restart of Reporting Engine, the CTAS Table is listed.

You are here
Table of Contents > Reporting Module References > Rule References > Warehouse Database Rule Definition Modes > Warehouse DB Advanced Rules

Attachments

    Outcomes