RSA Identity Governance & Lifecycle uses third party software called Workpoint for workflows. Workpoint server monitors provide a way of monitoring the Workpoint queues to ensure that workflows are running smoothly. If these queues become backed up, workflows can stall and at times the Workpoint server can shut down. At this point change requests stop processing and the RSA Identity Governance & Lifecycle application may come to a virtual halt.
This RSA Knowledge Base Article explains how to detect what is causing stalled workflows and/or the Workpoint server shutdown, and how to rectify the situation.
The Workpoint server slows down and possibly halts when the Workpoint queues become overloaded and cannot keep up. The first signs of this happening are workflow stall messages and slowing down of change request processing. Both good and bad design of workflows and change requests can cause this situation. For example, a perfectly valid change request that generates ten million items with one job per item can cause the Workpoint server to slow down. Over time a large change request that causes this slowdown will usually clear up on its own. But what if there is an aberrant or poorly designed change request? Workflows can operate independently from change requests such as custom tasks or rules, etc. What if one of these workflows becomes aberrant or is poorly designed?
Consider the following analogy. Traffic may be slow because there is an accident on the highway. Or it may be slow because there are too many cars on the highway. Either way, the situation will usually resolve itself given time. However, consider the situation where one person (WF JOBS) is driving 1000 cars (Work in the Queue). This will cause the Workpoint monitor to shutdown.
There are two parts to this resolution:
- Detect and terminate the aberrant change/request and/or workflow.
- Cleanup the Workpoint monitoring queues. Be careful not to cancel any valid change requests and work that is in the Workpoint queues.
Detect and terminate the aberrant change request or workflow
NOTE: If you are on an RSA Identity Governance & Lifecycle version prior to 7.1.1 P04, you may be having the issue reported in 000038393 -- Change requests get randomly cancelled and add unrelated activities or continue to provision in RSA Identity Governance & Lifecycle. Please refer to that RSA Knowledge Base Article before proceeding with this one to be sure you are not having that issue.
As the
avuser, execute the below SQL query, three times replacing
wp_[JOB or ALERT or SCRIPT]_monitor with
wp_JOB_monitor,
wp_ALERT_monitor, and
wp_SCRIPT_monitor for each run.
select * from (select wpi.cr_id, wpi.name, wpi.proc_state_id, job.* from wp_proci wpi join
(select proci_id, count(proci_id) as num_of_work_in_q from wp_[JOB or ALERT or SCRIPT]_monitor
group by proci_id
having count(proci_id) > 5
order by count(proci_id)) job
on wpi.proci_id = job.proci_id)
order by num_of_work_in_q desc;
This query returns
cr_id which is the internal id for a change request. If this is null, then this job is not associated with a change request. When viewing this query output, look for the change request or workflow that has created the most work for the Workpoint monitors. Typically one workflow job will normally add one entry to a queue. As seen when editing workflows, workflow paths are mostly sequential but sometimes have concurrent paths. Regardless, the number of entries in a Workpoint queue created by one job should be very close to one. If you find entries in the double digits or worse, this is a good time to engage
RSA Identity Governance & Lifecycle Support to help with identifying the change request or workflow causing the issue and assist with terminating the aberrant change request or workflow.
Cleanup the Workpoint monitoring queues
Please contact
RSA Identity Governance & Lifecycle Support for cleaning up the Workpoint monitoring queues after the aberrant change request or workflow has been terminated. Continuous monitoring of these queues may be needed for a while after such a situation to ensure the problem is truly resolved.