Assuming that you follow best practices for performance when writing correlation rules, reports, and alerts, does anyone have any systematic approaches or best practices for effectively investigating, troubleshooting and ultimately reducing alerter latency?
I'd be interested in knowing how to pinpoint from a resource consumption perspective which views, correlation rules, etc are contributing to high Alerter latency, and at what percentages. A rack and stack list of top-offenders causing or contributing to latency. I'd also like to create tabular and graphical reports that help me illustrate metrics surrounding the alerter service specific to latency in hopes to pinpoint surges in latency and correlate them against other factors present at that same point in time, like EPS input rates, resource consumption of the other NIC Services, etc. We're not running OVO or OVIS on the appliances, but would be interested also in hearing anyone's experience(s) in leveraging those tools to troubleshoot Alerter Latency.
And just to set the stage, I'll clarify a bit by saying the following answers do not count:
1. "You're probably looking to broad accross your logs... fine tune your alerts to look for just the needle in the haystack." I've heard this before and I agree that works well if you know what the needle is. In some cases, we don't have this luxury. I'm all for efficiency, but some times we have to run very broad searches across a large percentage of the incoming logs.
2. "You're overworking the ASRV / DSRV, and need to throw more hardware at it, like another Asrv or another stack for analysis / reporting purposes."
Both of wich are valid in some cases, but here I'm interested in tackling the root cause of the issue.
thanks in advance for your thoughts,