RSA Admin

Writing Parsers - Part 1: Introduction

Discussion created by RSA Admin Employee on Oct 12, 2012

This is the first post in a series on writing your own parsers.

Part 1:  Introduction


Before I get into the mechanics of writing a parser, I'd like to give an overview of some of the non-mechanical groundwork that goes into writing a parser.



There are, broadly, two types of parsers:

There are signature-based, IDS-style parsers which just register an alert if some pattern of bytes is encountered. This is the most simple type of parser, but I'm not going to be talking about these, because:

(a) if that's all you want a parser to do then you'll learn how to write simple parsers as you learn to write complex parsers
(b) focusing on signatures completely ignores the power of parsers


To that last point, what is the power of parsers? Parsers extract information from network traffic. If you've ever picked through a pcap in Wireshark asking questions and looking for answers then you've already done this manually. Parsers are able to automate the process of picking through pcaps asking questions - which frees you to analyze the answers instead of spending your time finding them in the first place. A signature is only good at telling you what you already know ("this combination of bytes is bad"). The output of a parser facilitates finding bad stuff you didn't even know about before.

So the other type parsers are those which extract information, that answer questions. These are much more complex, the type that require an understanding of what you're looking at before you even begin thinking about writing a parser for it.


It is far more important to understand what you are parsing than it is to know how to write a parser. Writing a parser is really just writing a script. Knowing the syntax of the script's statements is less important than understanding how to put them together meaningfully.


For example, imagine you wanted to write a signature to detect 0x0A0B0C0D0E0F. But what you are really interested in is the contents of a string that appears 96 bytes later, and in a different packet.


A signature can alert you to the presence of 0x0A0B0C0D0E0F - then you open the packets in Wireshark, find the pattern, count out 96 bytes being sure to skip over the packet and frame header bytes, see if the string is there and if so what is it. And you do all this for every session that the pattern is detected in.


Or, a parser could detect the pattern, look 96 bytes further into the stream, and register the string. So that when you open Investigator, the string is presented right there for you. Which work flow would you rather have?


Now that I'm done preaching to the choir... You'll have to answer a few questions before you begin writing your parser. Once you've answered these questions, the logic and flow of your parser will almost become obvious.


What do you want to extract?


You probably already have at least some vague notion as to what you want your parser to do. In any case, you write a parser because you want it to do something. As simple as it sounds, that is the most fundamental thing you have to figure out before you even start - what do you want it to do, specifically?


So the first question you have to ask yourself before you even open up your editor is "What information do I want to extract?"


The typical way to answer that question is a combination of:


(A) Considering the information you are looking for when you pick through a pcap of that type of session. Exactly what questions are you asking, what answers are you looking for?


(B) Considering the meta keys (categories) that the "answers" are registered as. Which of them can you find in the session? In other words, what else can you get?


From this point forward, I'll be referring to the information that you want to extract as "meta".


How do you get to it?


Once you know what meta you want your parser to extract from a particular type of session, the next question is: where is it in the session?


Another way to think about the question is: what's always found nearby? And, ideally, I do mean "always". It could be something obvious like "user=", it could be a series of non-ascii bytes, or it could be a combination of both.


In any case, it should be something fairly unique, something that won't typically be found in other types of sessions or in random data. It should also be fairly long (which helps with uniqueness) - at least six characters or bytes, ideally more, three or less is too few. 0x0D 0x0A is no good at all - not only is it too short (2 bytes), but it is found in almost every session as an end-of-line indicator.


You can't use the meta itself, since if you knew what it was beforehand then you wouldn't need a parser. If you wanted to extract usernames, you couldn't just use "Alice" or "Bob".


It doesn't have to be something immediately next to your meta. The closer the better, but uniqueness and specificity are more important than proximity.


Another way of thinking about this question is - when you open the session in Wireshark, how do you navigate to the field you are interested in? If you just rely on Wireshark to determine the fields for you, that's okay but you'll need to figure out how Wireshark determines which fields are which.


The answer to this question will be used to start the logic of your parser.


From the imaginary example above dealing with complex parsers vs. "signatures", the pattern 0x0A0B0C0D0E0F could be used to get to the string you are really interested in.


How do you know you've gotten to it (and not somewhere else)?


This is a similar question.  What else is always in the session?  What other characteristics do these sessions share?  From answering "what's nearby" you probably already have a list.


The answers to this question will be used as sanity, or false-positive, checks. So that when it comes time to register your meta, you know with certainty that it really is a username, or a string you wanted to know the contents of, or whatever - that it isn't gibberish.


You can go too far and risk false negatives. But if you answer this question well, then the risk is minimal.


This, to my mind, is the mark of a good parser - that consideration has gone into false-positive checking, that some effort is made not only to ensure that the parser is extracting what it is supposed to but also that it is not extracting what it shouldn't.


Simple example


Perhaps you want to know the filenames requested in HTTP sessions to your web server. You don't want to have to open every session in Wireshark just to see what the filename was.


What do you want to extract? Filenames from HTTP requests.


What else can you get?  Directory and extension are fairly obvious possibilites.  Request method could be registered as 'action' meta.


What is always nearby? The request method (GET, POST, etc.) and HTTP version (HTTP/1.0 or HTTP/1.1) should both always be present.


What else is always in the session? The end of the HTTP headers (0x0D0A0D0A). Various other HTTP headers that may or may not be present (HOST, User-Agent, etc.)