A recent customer question about alerting on Uptime values from the REST API got me digging into the Health and Wellness Policies for a better solution.
The request was to alert when the uptime value for specific device families was reset indicating that something had occured with the service and reset the uptime value. Repeated resets of the uptime value could indicate an issue with the service that needed attention (core files created as a result of decoder service crashes was the root of this request).
Here is my solution:
- Admin > Health and wellness > Policies
- Select the + and add a new policy for the service that you want to monitor
- In this case the Archiver service is our example
- Add a new Rule
- The conditions
- Alarm = Regex match on .., .. seconds.*
- REcovery = !Regex match on .., .. seconds.*
- Set your notification output at the bottom
- save and enable the policy at the top
Now you have a policy that alerts when the uptime is within the first 60 seconds of restarting (.. is two digits so up to 60 seconds) and recovers once the uptime doesnt match the pattern (when 60 seconds switches to minute and seconds (61 seconds +)
Details on the pattern developed:
number of seconds followed by a comma then the friendly time breakdown of the seconds in years, months, weeks, days, hours, minutes and seconds.
.. = looked for 2 digits for the seconds (between 10-59 seconds after service restarted)
, .. = looked for the same seconds value after the comma
seconds.* = the word seconds and the trailing space in the value
when this pattern is matched (between 10-59 seconds after restart) there will be an alarm, then it will clear when that pattern is not matched (60 seconds +)