This topic guides administrators in how to tune a Packet Decoder specifically for high speed packet capture.
This guide applies when capturing packets on a 10G interface card. Packet capture at high speeds requires careful configuration and pushes the Decoder hardware to its limits, so please read this entire topic when implementing a 10G capture solution.
- Series 4S Decoder
- Intel 82599-based ethernet card, such as the Intel x520. All RSA-provided 10G cards meet this requirement.
- 96 GB of DD3-1600 memory in dual-rank DIMMs. Single rank DIMMs may decrease performance by as much as 10%. To determine the speed and rank of the installed DIMMs, run the command dmidecode -t 17.
- Sufficiently large and fast storage to meet the capture requirement. Storage considerations are covered later in this topic.
- Linux kernel package obtained from RSA. Only Linux kernel packages provided by RSA are supported.
- The pfring package that matches the currently installed kernel. The kernel version must match the pfring version exactly.
10G Decoder Installation and Configuration Changes
The following steps are needed to install the 10G Decoder:
- Ensure your host has a sufficiently new BIOS. The Decoder 10G package has been tested with Dell R620 BIOS v1.2.6 from May, 2012. BIOS revisions earlier than this have issues properly identifying the location of the 10G capture card within the system. It is important to update the BIOS before installing packages, as the packages use information provided by the BIOS to initialize the system.
- Install or upgrade the Decoder, using the normal upgrade procedure.
- If pfring is not installed at this point, install the pfring package that matches the currently installed kernel.
- Reboot the Decoder.
Post-install configuration tasks:
- The session and packet pools must fit within half of main memory (so 48 GBs on a 96 GB host). For example:
- Set pool.packet.pages (under /decoder/config) at 1000000, which is 1m x 32 KB pages = 30.5 GBs of memory.
- Then, set pool.session.pages (also under /decoder/config) set to 300000, which is 300k x 8 KB pages = 2.3 GBs
- Packet write block size (/database/config/packet.write.block.size) must be set to exactly 4 GB. This enables a special code path to buffer the whole file in Huge Pages and dump to disk using Direct I/O, and this also buffers the entire file in memory and requires two memory buffers. The memory buffers will be sized according to /database/config/packet.file.size, so keep this in mind for memory requirements for the Decoder process. Even if the configured file size is larger than 4 GB, it will be buffered entirely memory. These two buffers are not allowed to be swappable and will have to fit in main memory.
For example, if packet.file.size is set to 8 GB, then the two memory buffers will consume 16 GBs of main memory.
- Select the capture driver p1p1 or p1p2 depending on capture port. To capture from both ports, select PFRINGZC,p1p1 and set capture.device.params to device=zc:p1p2,zc:p1p1
Some hardware shows up as PFRINGZC,eth4, PFRINGZC,eth5, PFRINGZC,p2p1 or PFRINGZC,p2p2. Make sure to select the correct input port.
- Ensure that the selected capture hardware is on the correct NUMA node. This is done as part of the install, but if any changes are made to the hardware after the installation, you may have to verify these settings.
- From an ssh session to the host, execute:
where <interface_name> is the selected capture interface (e.g. p2p1).
- If the result is 0, no additional configuration is necessary.
If the result is -1, then your server's BIOS did not report which CPU socket the capture card is wired to. Update your BIOS and re-start the installation. If the problem persists, you may need to contact the hardware manufacturer to obtain details about the NUMA node affinity of PCIe slots in your server.
- If the result is greater than 0, add the result as the parameter core to the capture parameters:
This change will take affect upon capture (re)start.
- If the write thread is having trouble sustaining the capture speed, you can try the following:
- Change /database/config/packet.integrity.flush to normal
- If that doesn't fix it, set all *.integrity.flush values to normal
- You can try adjusting the packet.file.size to something higher, but keep it under 10 GB as the whole file is buffered in memory at these speeds.
- (Optional) Application parsing is extremely CPU intensive and can cause the Decoder to drop packets. To mitigate application parsing induced drops, the setting /decoder/config/assembler.parse.valve can be set to true. This will have the following effects:
- When session parsing becomes a bottleneck, application parsers (HTTP, SMTP, FTP, etc…) will be temporarily disabled.
- Sessions are not dropped when the application parsers are disabled, just the fidelity of the parsing performed on those sessions.
- Sessions parsed when the application parsers are disabled will still have associated network meta (NETWORK parser)
- The statistic /decoder/parsers/stats/blowoff.count displays the count of all sessions that bypassed application parsers (network parsing is still performed).
- When session parsing is no longer a potential bottleneck, the application parsers are automatically re-enabled.
- The assembler session pool should be large enough that it is not forcing sessions. You can determine if sessions are being forced by the statistic /decoder/stats/assembler.sessions.forced (it will be increasing) and /decoder/stats/assembler.sessions will be within several hundred of /decoder/config/assembler.session.pool.
On a typical setup, at just under 10G, configure /decoder/config/assembler.session.pool to 1000000 and /decoder/stats/assembler.sessions will average 630K.
- Set the /decoder/config/numa.bindings parameter to default.
- Set the /decoder/config/parse.threads parameter to 12.
- After making changes to 10G configuration parameters, reboot the Decoder host.
- From an ssh session to the host, execute:
When capturing at 10G line rates, the storage system holding the packet and meta databases must be capable of sustained write throughput of 1400 MBytes/s.
There are several ways to achieve such high sustained throughput. Here we describe one such possible solution, though other storage architectures are possible.
Using the Series 4S hardware, with two DAC units
The Series 4S is equipped with a hardware RAID SAS controller capable of an aggregate 48Gbit/s of I/O throughput. It is equipped with 8 external 6 Gbit ports, organized into two 4-lane SAS cables. The recommended configuration for 10G is to balance at least 2 DAC units across these two external connectors. For example, connect 1 DAC to one port on SAS card, and then connect another DAC to the other port on the SAS card. As you add more DACs, chain them off of each port in a balanced manner.
As you add capacity, use the NwMakeArray script to provision the DAC units. This will automatically add them to NwDecoder10G's configuration as separate mount points. The independent mount points are important as it allows the NwDecoder10G to segregate write I/O from capture from the read I/O needed to satisfy packet content requests.
Other storage configurations (SAN, etc.)
The Decoder will allow any storage configuration that can meet the sustained throughput requirement. Note that the standard 8Gbit FC link to a SAN is not sufficient to store packet data at 10G, thus in order to use a SAN it may be required to perform aggregation across multiple targets using a software-RAID Scheme.
Parsing at High Speeds
Obviously, parsing raw packets at high speeds presents unique challenges. Given the high session and packet rates, parsing efficiency is paramount. A single parser that is inefficient (spends too long examining packets) can slow the whole system down to the point where packets are dropped at the card. For initial 10G testing, start with only native parsers (except SMB/WebMail). Use the native parsers to establish baseline performance and with little to no packet drops. Do not download any Live content until this has been done and the system is proven to capture without issue at high speeds.
After the system has been operational and running smoothly, Live content should be added very slowly - especially parsers. Parsers can have a dramatic effect on performance. Here are some rules of thumb:
Tested Live Content
The following parsers can all (not each) be run at 10G on our test data set:
- MA content (7 Lua parsers,1 feed, 1 application rule)
- 4 feeds (alert ids info, nwmalwaredomains, warning and suspicious)
- 41 application rules
- DNS_verbose_lua (disable DNS)
- MAIL_lua (disable MAIL)
- SNMP_lua (disable SNMP)
- SSH_lua (disable SSH)
- SMB_lua, native SMB disabled by default
- html_threat (oops, somehow forgot this one, will add and update)
HTTP_lua, reduces capture rate from >9G to <7G. At just under 5G this parser can be used in place of the native without dropping (in addition to the list above). xor_executable, will push parse CPU to 100% and system can drop significantly at time due to parse backup.
Aggregation on a 10G Decoder
A 10G Decoder can serve aggregation to a single concentrator while running at 10G speeds.
- Concentrator aggregates between 45-70k sessions/sec
- The 10G Decoder is capturing between 40-50k sessions/sec.
With content identified above, this is about 1.5 to 2 million meta/sec.
- Turn on nice aggregation on the Concentrator to limit the performance impact on the Decoder
- Due to the high volume of sessions on the concentrator, you may consider activating parallelvaluesmode on the concentrator by setting /sdk/config/parallel.values to true. This will improve investigation performance when the number of sessions per second is above 30k.
If multiple aggregation streams are necessary, it would be less impactful on the Decoder to aggregate from the Concentrator instead.