Ankush Baveja

Creating a Custom Feed Using Logs - With the Latest Entries Sorted by event.time

Blog Post created by Ankush Baveja Employee on Oct 4, 2017

There is an Out-of-the-Box (OOTB) Identity feed which can be configured to create a recurring feed using Active Directory logs. This feed provides added context to packet and log data for users in the environment.  The source ip (ip.src) is matched and the meta values below are added to the session.

  • User name
  • Domain
  • Workstation

This feed works great but it only works for Active Directory logs.  


In the blog below I will show you how you can create custom feeds using other log event sources.


A simple yet effective option to create a custom feed is to create a report using the Reporting Engine and exporting it as a .csv into a folder so that it can be accessed periodically by a recurring feed.  This process is documented very well in the following discussion post:


However, if the logs are continuously coming in then there is a good chance that you will see duplicate and obsolete entries in your .csv and applying that as a feed might not be your best option.  To resolve this, the .csv would require additional processing to remove the obsolete entries. 


For example, let's say that for some reason I wanted to create a feed based on incoming DHCP logs. Once I create a .csv for the last <x> hours, I would then have to apply additional logic to only keep the latest entries in the .csv and remove all of the old ones.


Since custom feeds (based on logs) can grow very large depending on the requirement, removing duplicates can be challenging especially when you have multiple columns in the .csv. In my case I had to remove duplicates from a .csv file with 300k rows and 5 columns so I needed to be mindful about system performance and I had to consider what to use for a dedup - list, maps, sets, file comparisons, etc.


That's when I came across Python Pandas library which solved this problem for me. Using this library I was able to dedup the .csv file to keep only latest entries in under 2 seconds with a simple set of code.


Below are the steps I took:

I used the script to collect the meta keys shown below (with a where clause) from a concentrator/broker:

  • event.time
  • ip.src
  • eth.src
  • host.src
  • event.desc


My output .csv looked something like this:

dhcp csv


The next step is to reduce the .csv to keep only the latest ip.src entries where latest is identified by event.time timestamp. This is where I used pandas and this single line code to do everything under 2 seconds:

import pandas as pd
pd.read_csv(filename).sort_values('event.time',ascending=True).drop_duplicates('ip.src', keep='last').drop(['event.time','count'],axis=1).to_csv(outfile,header=None,index=False)


So what’s happening in above code?

  1. Reading the file as .csv (using .csv file headers)
  2. Sorting the file on event.time
  3. Dropping duplicates for same ip.src by keeping only last entry (sorted latest)
  4. Dropping event.time and count columns from file
  5. Removing the header and writing to final output file named 'outfile'


Outfile then can be picked up by a recurring feed. Since this dedup is extremely fast you can have a very short interval on the recurring feed as long as there is no negative impact on your concentrator for running the query.


For any type of .csv manipulation, pandas library is an excellent and extremely fast option. Here is a cheat sheet that you can use as a guide.


If you have any feedback on the steps I took above, please do let me know.  I welcome your comments and would love to hear how this works for you.






1) I used script and not the default reporting engine option since I wanted to avoid reporting engine overhead. Also provides an option to keep headers which was useful while processing the file using pandas

2) script groups rows and puts a count as last column. I havent shown in above snapshot

3) Pandas does not work with python 2.6.6, you need python 2.7 +