RSA NetWitness v11.1 introduces powerful new text indexing features to the RSA NetWitness core database. However, they are disabled by default, since using these features imposes a cost in terms of storage retention, and potentially throughput. This article explains what these features are, and how to enable them.
This article refers to search features exposed by 'msearch', which is the RSA NetWitness core API that implements text search on collections.
Text Pattern Searching
RSA NetWitness allows for searching within the text of meta items that are indexed. It does this by utilizing all the indexes on the collection, and attempting to match the search input to values stored in any of the indexes. The default behavior of this search requires that the search input match some value that the indexer has seen. So text stored in meta items has to match in one of the ways that the indexer knows about. Prior to v11.1 we could find these kinds of matches:
- Exact match: The search string matches a meta value exactly. Matches are found if they exactly match a value stored in a value-level index.
- Regex match: The search string is treated as a regex, and the regex is tested against every value in every index. This is very slow, but still much faster than a raw data scan, since only the indexed values are checked for matches.
- Exact match on truncated token: The search string can match an index value if it is truncated to some predetermined length that the index knows about. The search engine uses this type of match to fetch results based on less-accurate indexes. It uses the full search string to refine the results so that false matches due to truncation are not returned as results. Therefore, this type of match is transparent to the user and it looks like an exact match. However, truncating indexes is an important optimization to be aware of from a performance perspective.
One significant limitation of pre v11.0 (and earlier) is that it's not easy to find a subset of the text within a meta value.
Some meta values are phrases rather than single words, such as: msg="Authentication Error".
If I wanted to search for the word "error", my index on msg does not yield a result because it's at the end of the value.
Other meta values embed important text within non-text data. A classic example of this is a text token inside a URL:
If I have an index on referrer, I can't easily use it to find "netwitness" in that value.
One way around this is to feed the raw log messages to the "word" indexer. This works around the first case, because both of the words are also copied to the word meta items. However, it does not solve the second scenario, and it incurs some overhead in terms of having some of the same text in both the "msg" and "word" meta items.
New Indexing Mode: N-grams
An N-Gram is a sequence of N characters in length extracted from any position within the text. We now have the ability to generate indexes from these sequences on text meta items. The index engine will extract all the subsequences out of the text value and store them in the index. Note that it only stores them in the index, rather than generating meta items. Therefore, it increases index storage usage, but not meta usage.
Consider the case of our referrer meta item I used above: if I turn on N-grams on my referrer index, then we will get index hits for any substring within our meta value. Embedded words like "local", "auth", "net", "netwitness", "ser", and so on will all be indexed. Therefore, simple substring searches will return useful matches.
You can turn on N-gram indexing on any index that meets these requirements:
- It must be a text meta type.
- It must be Value Indexed.
To turn on the N-Gram indexer, add these parameters to the key entry in your custom index configuration.
Here's an example:
<key description="Text Token" level="IndexValues" name="word" minLength="3" maxLength="5" format="Text" ngrams="all" />
The ngram parameter turns on the N-gram indexing mode. The min and max length parameters determine the range of size of N-grams that will be generated. The minimum length is important because it defines the minimum number of characters that a search term must contain in order for it to match anything. The maximum length is a performance optimization to limit the number of unique n-grams present in the index. Due to the exact match on truncated tokens logic mentioned earlier, the max length does not actually restrict the maximum length of search patterns. Instead, max length controls how accurate the index is and provides a performance tradeoff: Longer maximum lengths will more accurately identify the correct search matches, but they impose more index storage and RAM cost. Shorter maximum lengths provide a less accurate index that takes longer to resolve search results, but use less storage and RAM. I recommend using the default minimum of 3 and the default maximum of 5.
Activating any N-gram indexes has a significant cost. Using the parameters above, an N-gram index entry consumes approximately 5 times the index space as a non N-gram index. One rule of thumb would be: the cost of turning on an N-gram index is about the same as turning on 5 regular value indexes.
The valueMax for your index should correspond to however many tokens might be generated by your maxLength parameter. If you use the default maxLength of 5 or less, it is acceptable to not specify a valueMax.
How to search against N-grams
To utilize the n-gram matches within the event view search box, just type the substrings that you are looking for as your search terms. The N-Grams indexes will automatically be utilized where they are enabled. Don't worry about adding regexes or wildcards: just put whatever text you are looking for into the search box.
Finding text at the beginning of a word:
Or the middle of a word:
Or the end of a word:
The default search options for searches are suitable for N-Gram searches: specifically they require that "search indexes" is enabled.
There are additional search options that can help with more certain cases. For example, consider if we want to look for text specifically at the start of the meta or the end of the meta item. For that, the 11.1 search API does support glob character matching in the pattern. If I have a host name index and I want to match "*.com" in a meta value but want to exclude "*.com*", I can do that with glob patterns. They are primarily useful for filtering results to search patterns that match at the beginning or end of a meta item. Glob patterns have important caveats:
- turn off raw packet searching when using glob patterns that match the beginning or end of a string. The glob pattern will attempt to apply the prefix or suffix wildcard at the beginning or end of the entire raw payload, which is rarely the desired result.
- Glob pattern matches are for meta value matches only.
- Glob and regex patterns have conflicting, incompatible syntax. Do not try to use glob and regex at the same time.
- For substring matches anywhere in the text, don't use glob characters. Just specify the text you are looking for and let the N-Gram index find the substring.
- Glob patterns support the "*" and "?" characters similar to file-name matching in unix. The asterisk (*) matches zero or more characters, while the question mark (?) matches any single character. You can specify multiple wildcard characters in the search pattern.
Glob search example:
Word Indexing with N-Grams
The word indexer works with a special log parser that attempts to synthesize word meta items. This word generating parser works completely independently of the N-Gram indexing feature. You can use N-Gram indexes on any text key, not just the word meta key. Conversely, you could continue to use the word indexer without turning on N-Gram indexes at all. Using the N-Gram indexer in conjunction with the word indexer is a powerful combination, however. Logs that are fed to the word indexer, and then subsequently indexed with N-Gram indexes, yield a full-text indexed searchable database. The cost for implementing a search this way is high, but it may be manageable if it reduces the number of indexes needed elsewhere in your custom index. If you do decide to enable N-grams on the word index, turn off truncation in the word tokenizer parser on the Log Decoder. This will ensure that all the possible substrings make their way into the index.
Text Searching in Network Meta
N-gram indexes can be used with packet collections, not just logs. You can add an N-gram index to any of the text meta items, as long as they are indexed at the value level. This can be very useful for searching within meta values that are long or complicated, and would otherwise require using the contains or ends query operation.