2 1/2 hours to open Alert History
I inherited a broken and nearly unused enVision box about 2 years ago. I have spent most of this time reading the help files and piecing it together. The person I replaced apparently didn't spend any time at all setting it up.
I attended RSA enVision Administration and Operations a few weeks ago and have spent most of my time in the office since then correcting various setup mistakes and changing the direction we have been going with it.
My problem is with the SonicWall alerts. I had an alert incorrectly configured, and enVision was swamped with alerts. To top that off, an old Sonicwall logging server was inadvertantly turned on which compounded the issue. Now, it takes about 2 1/2 hours to open the alert history for that device. When I click on the message, it opens in 4 or 5 minutes. I can then disposition the alert, and it takes another 2 1/2 hours or so for the alert history to display again.
My question is twofold:
1) What would cause this type of performance issue? How do we diagnose and correct it?
2) At this time we are not actively using the alert history, nor are we using the functionality of dispositioning the alerts (though that may come about at a later date). How would we go about offloading the alerts and clearing them off the appliance?
Our equipment: We have a 50-series HA appliance (Windows 2000). enVision version is 4.0.0 Build 0228. We have a total of 108 monitored devices, but only 60 are active. We are not currently using Event Explorer (licensing issue - we're working on it).
Here are a few days of the event totals for the SonicWall: 1,262,056; 1,103,179; 1,106,481; 1,213,086.
The alert history is slow for all the other alerts, but not as bad as those for the SonicWall.
Sorry for the length of this post, but I wanted to include any information that may be relevent. Please let me know if there is any other information necessary to troubleshoot this.
I found another post by Ryan that addressed a different issue, but led me to the Alert History configuration page (not sure why I didn't think of that!). I changed the alert history to The Past 2 Days. This won't be a permanent setting, but will get me through for a while. This will work for us since we are not actively using the alert history except for reference at this time.
Even at 2 days, there are 114148 alerts for the Sonicwall. No wonder it was taking so long to load! Now it is manageable, taking minimal amount of time to load.
Sorry to hear about your recent acquisition/adoption/inherited enVision platform. Sounds like your making the right efforts though to get it back in shape, so that's good. After reading your thread I'm sure there are a lot things at play here, causing your high number of alerts, but I just wanted to suggest / remind you of a really cool feature that you might be able to leverage as a temporary stop gap to get you some alert relief till you figure out what's causing so many hits. That is... Alert suppression.
You may have read about it in help, but in a nut shell, alert suppression is designed for cases like this where you want to get an alert, but not necessarily ALL of them. There's a lot of flexibility around how you can set them up, and it's really simple to do. When I make a new correlation alert and I think I've tested it out enough, I always (ok, *most* of the time) enable alert suppression when deploying the alert in production, at least for the first 48 to 72 hours.
I know exactly what you mean about the GUI taking forever to load the alert history. Make sure you check with support on that also, I recall an EBF they gave me that fixed / improved this issue quite a bit, but even so, if an alert is too broad / loose it can fire so much that it causes similar symptoms. Alert suppression is a very cool thing, especially when tweaking existing alerts down to where they should be. When you're comfortable with the rule, then you can lift the suppression and let er rip.
Thanks for the suggestion, Ryan. I'll go back in and look at the suppression feature.
The cause of the alerts was known at the time, but I didn't see the alerts coming in until they had been going on for a while (as I mentioned, we are still in setup stages, and not fully in production so it wasn't being monitored).
For me, the frustrating part was that enVision reached the point that the alerts couldn't even be managed, much less used. I'm wondering if an archive feature should be incorporated into the alert history page. After taking so long for the alert history page to load, it would have been helpful to have a SELECT ALL feature, and a MOVE TO ARCHIVE button or something like that. Maybe there is a way to dump the alerts to an archive in some other manner?
The term "Alert History" is misleading to a point, because when monitoring the Real-time Detail screen, Alert History is where you are taken to get the information on the alerts. I think more care should be taken to keep that area functional, as it is critical when an event is in process. I think it would make sense to have a Real-time Summary screen which would be what is currently the Real-time Detail, then have a Real-time Detail screen that is configurable to show the last X minutes, which should keep it light and functional. Then there should be an Alert History page that would allow viewing events that should really be considered "historical".
I'm still learning enVision. Maybe I'm missing something in the functionality that would have made this event a little easier to manage (besides the suppression, which I'll get into this week). As it turns out, this was a good event for me to help me be better prepared if/when we have an event in the production environment.
No I don't think you're missing anything. I think you're spot on in your observations. I too struggled with this same constraint between Real-Time Detail page and the Alert History page. Only heavy users who are in the tool daily, trying to get some operational security monitoring out of it really notice these kinds of things. Yes, frustrating I know.
When it chokes the system up like that the vendor will likely tell you to bounce the alerter service to clear them all. Very intrusive to say the least. Then they'll suggest that the long term fix is to optimize your alert, that it is written too broad, so go refine what it's looking for to be tighter (resulting in fewer alerts, resulting in better performace of those two pages), but I'd argue that they've never had to build rules on real enterprise either.
For what it's worth, we've abandoned using both of them for our real time monitoring efforts, and have opted to run a series of reports against the Alerts table that we shove up to the dashboard. It's kind of along the lines of what you're referring to, light and functional. We do some stuff with different summary counts, so like running every 5 minutes something that shows <AlertName><IP><count> sorted by count decending, to show the top offending IP's of specific alerts. I'm sure there's other ways to go about it, but its far more efficient. Come to think of it, the only time i use those two pages now days is in build/test/troubleshooting of new alerts. It's great for debugging, but it doesn't lend itself to real world monitoring.
Best of luck,