Core DB: Query Syntax

Document created by RSA Information Design and Development on Jul 21, 2016
Version 1Show Document
  • View in full screen mode
 

This topic covers the database query syntax. There are three main mechanisms for performing queries in the database, the query, values, and msearch calls on the /sdk folder on each Core service.

The query call returns meta items from the meta database, possibly using the index for fast retrieval.

The values call returns groups of unique meta values sorted by some criteria. It is optimized to return a subset of the unique values sorted by an aggregate function such as count.

The msearch call takes text search terms as it's input, and returns matching sessions that match the search terms.  It can search within indexes, meta, raw packets, or raw logs.

Query Syntax

The query message has the following syntax:

 query-params = size-param, space, query-param, {space, start-meta-param}, {space, end-meta-param};
size-param = "size=", ? integer between 0 and 1,677,721 ? ;
query-param = "query=", query-string ;
start-meta-param = "id1=", metaid ;
end-meta-param = "id2=", metaid ;
metaid = ? any meta ID from the meta database ? ;

The id1, id2, and size parameters form a paging mechanism for returning a large number of results from the database. Their usage mostly benefits developers who are writing applications directly against the Security Analytics Core database. Normally, results are returned in the order of oldest to newest data (higher meta IDs are always more recent). In order to return results from most recent to oldest, reverse the IDs such that id1 is larger than id2. This has a slight performance penalty, because the where clause must be completely evaluated before processing in reverse order can begin.

When size is left off or set to zero, the system streams back all results without paging. For the RESTful interface, this results in the full response to be returned with chunked-encoding. The native protocol returns the results over multiple messages.

The query parameter is a query command string with its own Security Analytics specific syntax:

 query-string = select-clause {, where-clause} {, group-by-clause {, order-by-clause } } ;
select-clause = "select ", ( "*" | meta-or-aggregate {, meta-or-aggregate} ) ;
where-clause = " where ", { where-criteria } ;
meta-or-aggregate = meta_key | aggregate_func, "(", meta_key, ")" ;
aggregate_func = "sum" | "count" | "min" | "max" | "avg" | "distinct" | "first" | "last" | "len" | "countdistinct" ;
group-by-clause = " group by ", meta-key-list
meta-key-list = meta_key {, meta-key-list}
order-by-clause = " order by ", order-by-column
order-by-column = meta-or-aggregate { "asc" | "desc" } {, order-by-column}

The select clause allows you to specify either * to return all the meta in all the sessions that match the where clause, or a set of  meta field names and aggregate functions to select a subset of the meta with each session.

The aggregate functions have the following effect on the query result set.

                                                   
FunctionResult
sumAdd all meta values together; only works on numbers
countThe total number of meta fields that would have been returned
minThe minimum value seen
maxThe maximum value seen
avgThe average value for the number
distinctReturns a list of all unique values seen
countdistinctReturns the number of unique values seen.  Countdistinct is equivalent to the number of metas that would have been returned by the distinct function.
firstReturns the first value seen
lastReturns the last value seen
lenConverts all field values to a UInt32 length instead of returning the actual value. This length is the number of bytes to store the actual value, not the length of the structure stored in the meta database. For example, the word "NetWitness" returns a length of 10. All IPv4 fields, like ip.src, return 4 bytes.

Where Clauses

The where clause is a filter specification that allows you to select sessions out of the collection by using the index.

Syntax:

 where-criteria = criteria-or-group, { space, logical-op, space, criteria-or-group } ;
criteria-or-group = criteria | group ;
criteria = meta-key, ( unary-op | binary-op meta-value-ranges ) ;
group = ["~"], "(" where-clause ")" ;
logical-op = "&&" | "||" ;
unary-op = "exists" | "!exists" ;
binary-op = "=" | "!=" | "<" | ">" | ">=" | "<=" | "begins" | "contains" | "ends" | "regex" ;
meta-value-ranges = meta-value-range, { ",", meta-value-range } ;
meta-value-range = (meta-value | "l" ), [ "-", ( meta-value | "u" ) ] ;
meta-value = number | ( '"' text '"' ) | ip-address | mac-address | ( '"' date-time '"' ) ;

When specifying rule criteria, the meta-value part of the clause is expected to match the type of the meta specified by the meta-key. For example, if the key is ip.src the meta-value should be an IPv4 address.

Query Operators

The following table describes the function of each operator.

                                                               
OperatorFunction
=Match sessions containing the meta value exactly. If a range of values is specified, any of the values is considered a match.
!=Matches all sessions that would not match the same clause as if it were written with the = operator.
<For numeric values, matches sessions containing meta with the numeric value less than the right side. If the right side is a range, the first value in the range is considered. If multiple ranges are specified, the behavior is undefined. For text metas, a lexicographical comparison is performed.
<=Same behavior as <, but sessions containing meta that equals the value exactly are also considered matches.
>Similar to the < operator, but matches sessions where the numeric value is greater than the right side. If the right side is a range, the last value in the range is considered for the comparison.
>=Same behavior as >, but sessions containing meta that equals the value exactly are also considered matches.
beginsMatches sessions that contain text meta value that starts with the same characters as the right side.
endsMatches sessions that contain text meta that ends with the same characters as the right side.
containsMatches sessions that contain text meta that contains the substring given on the right side.
regexMatches sessions that contain text meta that matches the regex given on the right side. The regex parsing is handled by boost::regex.
existsMatches sessions that contain any meta value with the given meta key.
!existsMatches sessions that do not contain any meta value with the given meta key.
lengthMatches sessions that contain text meta values of a certain length.  The expression on the right side must be a non-negative number.

Text Values

The system expects quoted text values. Although unquoted strings may work, the values expressed may be ambiguously interpreted as numbers or dates.

It is also important to quote any text value that may contain - so that it is not interpreted as a range.

IP Addresses

IP addresses can be expressed using standard text representations for IPv4 and IPv6 addresses. In addition, the query can use https://en.wikipedia.org/wiki/Classless_Inter-Domain_Routing#CIDR_notationCIDR notation to express a range of addresses. If CIDR notation is used, it is expanded to the equivalent value range.

MAC Addresses

A MAC address can be specified using standard MAC address notation: aa:bb:cc:dd:ee:ff

Date and Time Expressions

In Security Analytics Core, dates are represented using Unix epoch time, which is the number of seconds since Jan 1, 1970 UTC. In queries, you can express the time as this number of seconds, or you can use the string representation. The string representation for the date and time is YYYY-mmm-DD HH:MM:SS. A three-letter abbreviation represents the month. You can also express the Month as a two-digit number, 01–12.

All times specified in queries are expected to be in UTC.

Special Range Values

Ranges are normally expressed with the syntax "smallest" - "largest", but there are some special placeholder values you can use in range expressions. You can use the letter l to represent the lower-bound of the all meta values as the start of the range, and u to represent the upper bound. The bounds are determined by looking at the smallest or largest meta value found in the index out of all the meta values that have already entered the index.

Note: If you use the l or u tag, it should be unquoted.

For example, the expression time = "2014-may-20 11:57:00" - u would match all time from that 2014-may-20 11:57:00 to the most recent time found in the collection.

Notice that it is easy to confuse a range expression with a text string. Make sure that text values that contain - are quoted, and that hyphens within range expressions are not within quoted text.

Group By Clause (since 10.5)

The query API has the ability to generate aggregate groups from the results of a query call. This is done using a GROUP BY clause on the query.  When GROUP BY is specified, the result set for the query is subdivided into groups.  Each group of results is uniquely identified by the meta values indicated in the group by clause. 

For example, consider the query select count(ip.dst). This query returns a count of all ip.dst metas in the database. However, if you add a group by clause, like this: select count(ip.dst) group by ip.src, the query returns a count of the ip.dst metas found for each unique ip.src.

As of Security Analytics version 10.5, you can utilize up to 6 meta fields in a group by clause.

The group by clause shares some of the same functionality as the values call, but it offers significantly more advanced groups at the expense of longer query times. Producing the results of a grouped query involves reading the meta from the meta database for all sessions that match the WHERE clause, while a values call can produce its aggregates by reading the index only.

The contents of each group returned by the query are defined by the select clause. The select clause can contain any of the aggregate functions or meta fields selected. If multiple aggregates are selected, the result of the aggregate function is defined for each group. If non-aggregate fields are selected, the meta fields are returned in batches for each group.

The result set of a group by query is encoded with the following rules:

  1. All metas associated with a group are delivered with the same group number.
  2. The first metas returned to the group identify the group key. For example, if the group by clause specifies group by ip.src, then the first meta of each group will be an ip.src.
  3. The normal, non aggregate metas are returned after the group key, but they all will have the same group number as the group key metas.
  4. The aggregate result meta fields for each group are returned next.
  5. All fields within a group are returned together. Different group results will not be interleaved.

If one of the GROUP BY metas is missing from one of the sessions matched by the where clause, that meta field is treated as a NULL for the purposes of that group. When the results for that group are returned, the NULL-valued parts of the group key will be omitted from the group's results, since the database has no concept of NULL.

The semantics of a GROUP BY query differ from a SQL-like database in terms of what meta fields are returned. SQL databases require you to explicitly select the group by columns in the select clause if you want them to be returned in the result set. The Security Analytics Core database always implicitly returns the group columns first.

A query with a GROUP BY clause honors the result set size parameter, if one is provided. However, due to the nature of the grouping, it puts an additional burden on the caller to page and reform groups if a fixed-size result set is requested. For this reason, you should not specify an explicit result size when making a group by call. By not specifying an explicit size, the entire result set will be delivered as partial results.

The following table describes the database honors configuration parameters that limit I/O or memory impact of a group by query.

                   
ParameterFunction
/sdk/config/max.query.groupsThis is the limit on how many groups can be held in memory to calculate aggregates. This parameter allows you to limit the overall memory usage of the query.
/sdk/config/max.where.clause.sessionsThis is the limit on how many sessions from the where clause can be processed in a query. This parameter allows you to set a limit on the number of sessions that have to be read from the meta and session databases to resolve a query.

Order By Clause (since 10.5)

An order by clause can be added to a query than contains a group by clause. The order by clause causes the set of grouped results to be returned in sorted order.

An order by consists of a set of items to sort by, in ascending or descending order. Sorting can be performed on any data field that will be returned in the result set. This includes meta specified by the select clause, aggregate function results specified by the select clause, or group by meta fields.

The order by clause can sort over many columns. There is no limit on the number of order-by columns allowed in the query, but a practical limit exists in that each of the order-by columns must refer to something returned by the select clause or group by clause.  The multiple column sort is imposed lexicographically, meaning that if two groups have equal values for the first column, then they are sorted by the second columns. If they are equal in the second column, they are sorted by the third column, and so on for however many order by columns are provided.

The Security Analytics Core database is unique in that the groups of results returned by a query may each have many values for a selection.  For example, it is possible to select all meta that matches a meta type and organize it into groups, and it is possible to use the distinct() function to return groups of distinct meta values. If an order by clause references one of the fields in the group that multiple values, the sorting order is applied as follows:

  1. Within each group, the fields with multiple matching values are ordered by the ordering clause
  2. All the groups are sorted by comparing the first occurrence of the ordered field found within each group

The order by clause is only available in queries that have a group by clause, since groups are required to organize the meta fields into distinct records. If you wish to sort an arbitrary query as if there were no grouping applied, use group by sessionid. This ensures that results are returned in groups of distinct sessions or events.

Group by clauses are naturally returned in ascending group key order, but an order by clause can be used to return groups in a different order.

If an order by columns does not specify asc or desc, the default ordering is ascending.

Examples:

 select countdistinct(ip.dst) GROUP BY ip.src ORDER BY countdistinct(ip.dst)
select countdistinct(ip.dst) GROUP BY ip.src ORDER BY countdistinct(ip.dst) desc
select countdistinct(ip.dst),sum(size) GROUP BY ip.src ORDER BY sum(size) desc, countdistinct(ip.dst)
select sum(size) GROUP BY ip.src, ip.dst ORDER BY ip.dst desc
select user.dst,time GROUP BY sessionid ORDER BY user.dst
select * GROUP BY sessionid ORDER BY time

Values call

The index provides a low-level values function to access the unique meta values that have been stored in the index. This function allows developers to perform more advanced operations on groups of unique meta values.

Values call parameter syntax:

 values-params = field-name-param, space, where-param, space, size-param, {space, flags-param} {space, start-meta-param}, {space, end-meta-param}, {space, threshold-param}, {space, aggregate-func-param}, {space, aggregate-field-param}, {space, min-param}, {space, max-param} ;
field-name-param = "fieldName=", meta-key ;
where-param = "where=", where-clause ;
size-param = "size=", ? integer between 1 and 1,677,721 ? ;
start-meta-param = ? same as query message ?
end-meta-param = ? same as query message ?
flags-param = "flags=", {values-flag, {"," values-flag} } ;
values-flag = "sessions" | "size" | "packets" | "sort-total" | "sort-value" | "order-ascending" | "order-descending" ;
threshold-flag = "threshold=", ? non-negative integer ? ;
aggregate-func-param = "aggregateFunction=", { aggregate-func-flag } ;
aggregate-func-flag = "count" | "sum" ;
aggregate-field-param = "aggregateFieldName=", meta-key ;
min-param = "min=", meta-value ;
max-param = "max=", meta-value ;

The values call provides the function of returning a set of unique meta values for a given meta key. For each unique value, the values call can provide an aggregate total count. The function used to generate the total is controlled by the flags parameter.

Parameters

The following table describes the function of each parameter.

                                               
ParameterFunction
fieldNameThis is the meta key name for which you retrieve unique values. For example, if fieldName is ip.src, this function returns the unique source IP values in the collection.
whereThis is a where clause which filters the set of sessions for which the unique values are returned. For example, if the fieldName is ip.src, and the where clause is ip.src = 192.168.0.0/16, only values in the range of 192.168.0.0 to 192.168.255.255 are returned. For information on the where clause syntax, see Where Clauses.
sizeThe size of the set of unique values to return. This function is optimized to return a small subset of the possible unique values in the database.
id1, id2These optional parameters limit the scope of the search for unique values to a specific region of the meta database and the index. Setting the id1 and id2 parameters to a limited range of the meta database is very important to running searches quickly on large collections.
flagsFlags control how the values are sorted and totaled. Flags are described in the following Values Flags section.
thresholdSetting the threshold parameter allows the values call to short-cut collection of the total associated with each value once the threshold is reached. By providing a threshold, the caller can reduce the amount of index and meta items that must be retrieved from the database. If the threshold parameter is omitted or set to 0, this optimization is not used.
aggregateFunctionOptional parameter used to change the default behavior from counting sessions, packets, or size to counting or summing the numeric field defined by aggregateFieldName. Both parameters must be specified when either is defined. Pass either sum or count to specify which behavior to perform.
aggregateFieldNameThe meta field to perform the aggregateFunction on. Both aggregateFunction and aggregateFieldName parameters must be specified when the aggregate flag is set. Performing a values call using one of the aggregate functions can be significantly slower than a values call that collects totals of sessions, packets, or size. The reason for this is that each session that matches the where clause must be retrieved from the meta database. This scan causes a large portion of the query to be I/O bound on the meta DB volumes. The time taken to run an aggregate values call is linearly proportional to the number of sessions that match the where clause.
min, maxThe minimum and maximum value that should be returned from the call. These parameters are used to iterate (or page) over an extremely large number of values, typically more values than could be returned from a single call. Primarily used in conjuction with the flags sort-value,sort-ascending such that the highest value returned would be used in a subsequent call as the min parameter value. The values are exclusive. If min=“rsa" was specified and rsa was a valid value, rsa would not be returned, but the next highest value would be returned.

Values Flags

The flags parameter controls how the values call operates. There are three groups of flags that correspond to the different modes of operation as shown in the following table.

                       
FlagDescription
sessions, size, packetsThe values call allows you to specify one of these flags to determine how the total for each value is calculated. If the flag is sessions, the values call returns a count of sessions that contain each value. If the flag is size, the values call totals the size of all sessions that contain each unique value, and reports the total size for each unique value. If the flag is packets, the values call totals the number of packets in all sessions that contain each unique value, and then reports that total for each unique value.
sort-total, sort-valueThese flags control how results are sorted. If the flag is sort-total, the result set is sorted in order of the totals collected. If the flag is sort-value, the results are returned in order of the sorting order of the values.
order-ascending, order-descendingThese flags control the sort order of the result set. For example, if sorting by total in descending order, the values with the greatest total are returned first.

Values Call Example

The values call is used extensively by the Navigation view in Security Analytics. The default view generates calls that look like this:

 /sdk/values id1=198564099173 id2=1542925695937 size=20 flags=sessions,sort-total,order-descending threshold=100000 fieldName=ip.src where="time=\"2014-May-20 13:12:00\"-\"2014-May-21 13:11:59\"" 

In this example, the Navigation view requests unique values for ip.src. It requests unique values of ip.src in the time range given. It asks for the count of sessions that match each ip.src, and the results are the top 20 ip.src values when sorted by the number total count of sessions in descending order. In addition, the Navigation view has a meta ID range in order to provide an optimization hint to the query engine.

Msearch call

The index provides a low-level msearch function to perform text searches against all meta types. This type of search does not require users to define their queries in terms of known meta types. Instead, it searches all parts of the database for matches. Msearch is used by the Events view text search. See the Filter and Search Results in the Events View topic in the Investigation and Malware Analysis Guide for detail on the accepted search forms and examples.

msearch parameters:

 msearch-params = search-param, {space, where-param}, {space, limit-param}, {space, flags-param};
search-param = "search=", ? free-form search string ? ;
where-param = "where=", ? optional where clause ? ;
limit-param = "limit=", ? optional session scan limit ? ;
flags = "flags=", {msearch-flag, {"," msearch-flag} };
msearch-flag = "sp" | "sm" | "si" | "ci" | "regex" ;

The msearch algorithm works as follows:

  1. A set of sessions is identified from the index by finding the intersection of three sets:
    • (Set 1) All sessions in the database
    • (Set 2) Sessions that match the where clause parameter
    • (Set 3) If the si flag is specified, sessions that indexed values that match the search string parameter.
  2. If the search specifies the sm parameter, all meta from the set of sessions identified in step 1 is read and scanned to see if it matches the search string parameter. The meta will be read from the service nearest to the point where the search was executed. For example, if the search is performed on a Broker, the meta may be read from the Concentrator nearest to the broker, but if the search is performed on an Archiver the meta will be read from the Archiver itself.
  3. If the search specifies the sp parameter, all raw packet or log entries from the set of sessions identified in step 1 is read and scanned to see if it matches the search string parameter. The packets will be read from the service nearest to the point where the search was executed. For example, if the search is performed on a Concentrator, the packet data will be read from the Decoder, but if the search is performed on an Archiver, the packet data will be read from the Archiver itself.
  4. Matches from step 2 and step 3 are returned as they are found, up to the point where the limit parameter is reached. Limit specifies the maximum number of sessions for which meta and packet data will be scanned. If limit is not specified, the entire set of sessions determined in step 1 is scanned.

Msearch Flags

                               
FlagDescription
spScans raw packet data
smScans all meta data
siDoes index lookups for all search parameters before scanning meta
ciPerforms a case insensitive search. Returned results are case-preserving.
regexTreats the search parameter as a regular expression. Only a single regular expression can be specified, but the regular expression may be arbitrarily complex.

Msearch Index Search Mode

Using the index search mode, specified by using the si flag, causes results to be returned significantly faster than any other mode. The main limitation of this mode is that it only returns matches on text terms that match value-indexed meta values.

  • The si parameter must be combined with the sm flag. The si parameter implies the search only matches indexed meta. 
  • The si parameter can be used with regex searches, however only text indexed values will match. IP addresses and numbers will not match the regex.

Msearch Tips

  • Always use the where clause to specify a time range for the search.  
  • To search for IP address ranges, specify them in the where clause.
  • Use the limit parameter when not using the index search mode. Without it, there will be an extremely large amount of data read by the meta and packet databases.

Stored Procedures

The query and values calls provide more low-level search functionality. For more advanced use cases, server-side stored procedures exist.

Use of Quotes in Query Syntax

The query parser does not care whether you use single or double quotes within a query statement. A single- or double-quoted value is treated as text meta.

The query parser attempts to make sense of whatever you put in the statement. It is not very strict about what it will accept.

For example:

reference.id=4752

This clause identifies sessions that have a reference.id meta value that has a _numeric_ value of 4752.

reference.id='4752' or reference.id="4752"

This clause identifies sessions that have a reference.id meta value that has a _string_ value of "4752".

However, the query engine implicitly compares numbers and strings that look like numbers as equal when the values are semantically the same. So it works with either syntax.

For most efficient performance, however, it is always a good idea to construct the queries such that the query syntax matches the data types generated by the parser.

For example, if the parser is creating reference.id as a numeric data type (such as uint32 or uint64), then use the numeric syntax.

If the parser is creating reference.id as a text data type, then use the string syntax.

You are here: Queries

Attachments

    Outcomes