Article Number
000038393
Applies To
RSA Product Set: RSA Identity Governance & Lifecycle
RSA Version/Condition: 7.1.0, 7.1.1
Issue
Change requests get randomly and unexpectedly cancelled in RSA Identity Governance & Lifecycle and generate up to thousands of unrelated activities and/or continue to provision even though they have been cancelled.
Additional symptoms include:
- High CPU usage
- Workflows stop progressing
- Change requests stop progressing
- There may be database performance issues due to the rapid growth of Workpoint tables such as WP_PROCI.
Errors potentially seen in the log files include:
- Oracle alert log (/u01/app/oracle/diag/rdbms/avdb/AVDB/trace/alert_AVDB.log😞 (rapid database growth)
ORA-1653: unable to extend table AVUSER.WP_PROCI by 8192 in tablespace DATA_1M
- aveksaServer.log file ($AVEKSA_HOME/wildfly/standalone/log/aveksaServer.log😞 (heavy workload)
0-01-24 10:09:03,430 [Worker_jobq#Normal#WPDS_30718]
ERROR com.workpoint.services.impl.GenericServiceImpl - Numeric Overflow
- WorkPoint.log file ($AVEKSA_HOME/wildfly/standalone/log/WorkPoint.log😞
This issue can cause a single change request to be associated with a huge number (hundreds of thousands or millions) of unrelated change request items all with workflow tasks that are not relevant. These workflow tasks typically are not actionable, but they are on the Workpoint task list and are processed by Workpoint. This may result in errors similar to but not identical to the following in the WorkPoint.log:
2020-01-21 10:29:13,403 [Worker_jobq#Normal#WPDS_8]
ERROR com.workpoint.server.monitor.JobMonitorHelper -
doCreateJobs() failed creating Split Node Job for Split ID 671879 in node 136338560:WPDS
in parent Job 6251271:WPDS
java.sql.SQLException: BeanFactory not initialized or already closed -
call 'refresh' before accessing beans via the ApplicationContext
Cause
This is a known issue reported in engineering tickets ACM-97445, ACM-101694 and ACM-101765.
This issue occurs when a select query used to expand (create) Workpoint items for change request details by Change Request ID faults at the database abstraction layer. We fail to trap the exception leaving the query to match all change request details for other change requests. This causes Workpoint to instantiate a very large number (millions) of irrelevant jobs related to the change request but related to other change request details.
Resolution
This issue is resolved in the following RSA Identity Governance & Lifecycle versions and patch levels which introduces additional error trapping around the database connection layer.
- RSA Identity Governance & Lifecycle 7.1.1 P04
- RSA Identity Governance & Lifecycle 7.2.0
Customers who have encountered this issue should be strongly encouraged to patch to a version where this is resolved. Even if a particular event self resolves the severity of future events cannot be predicted.
NOTE: Until you are able to upgrade, you can use the tools provided in RSA Knowledge Base Article 000038546 -- Queries to detect workflows and change requests that could overload the Workpoint Server in RSA Identity Governance & Lifecycle to monitor the system for change requests or workflows that may potentially cause this issue.
Workaround
This is not a workround. These are instructions for resolving the situation once it has occurred. Please patch asap.
There are two parts to this resolution:
- Detect and terminate the aberrant change request and/or workflow.
- Cleanup the Workpoint monitoring queues. Be careful not to cancel any valid change requests and work that is in the Workpoint queues.
As avuser run the following SQL detection scripts:
- Script 1: Execute this script three times replacing wp_[JOB or ALERT or SCRIPT] with wp_JOB_monitor, wp_ALERT_monitor, and wp_SCRIPT_monitor for each run.
The output gives you a list of change requests with corrupted information from unrelated change requests that are currently being processed.
select * from
(select * from (Select wpi.cr_id, wpi.name, wpi.proc_state_id, job.* from wp_proci wpi join
(select proci_id, count(proci_id) as NUM_OF_WORK_IN_Q from wp_[JOB or ALERT or SCRIPT]_monitor
group by proci_id
having count(proci_id) > 5
order by count(proci_id)) job
on wpi.proci_id = job.proci_id)
order by NUM_OF_WORK_IN_Q desc) LIVE
JOIN
(select * from
(with tbleA as (
select
PS.PROCI_ID,
ji.cri_id
from avuser.WP_PROCI_SPLIT ps
join avuser.T_AV_WFJOB_ITEMS ji
on PS.PROCI_NODE_ID = JI.NODE_ID and JI.NODE_DB = PS.PROCI_NODE_DB and JI.SPLIT_ID = PS.SPLIT_ID
)
select distinct
cr.id,
cr.name,
cr.request_date,
cr.completed_date
from tbleA tmp
join AVUSER.WP_PROCI pi on tmp.PROCI_ID = pi.PROCI_ID
join AVUSER.PV_CHANGE_REQUEST_DETAIL crd on tmp.cri_id = crd.id
join AVUSER.PV_CHANGE_REQUEST cr on pi.cr_id = cr.id
join AVUSER.PV_CHANGE_REQUEST cr1 on CRD.CHANGE_REQUEST_ID = cr1.id
where PI.CR_ID != CRD.CHANGE_REQUEST_ID)) BAD
on LIVE.CR_ID = BAD.ID
NOTE: This same issue can occur with an aberrant workflow that is not associated with a change request such as custom tasks or rules. In this case the cr_id will be null.
The output gives you a list of cancelled CRs that are still provisioning:
select cr.* from t_av_change_requests cr
where exists
(select 1 from t_av_change_request_details crd
where cr.id = crd.change_requests_id and crd.state = 'CA')
and not exists
(select 1 from t_av_change_request_details crd
where cr.id = crd.change_requests_id and crd.operand_type != 'RF' and crd.state != 'CA')
and cr.current_state = 'PV';
If you have output from either script, you have this issue.
NEXT STEPS:
- You must immediately identify the change request that is affected and complete the cancellation of that request. Until the change request is fully cancelled the system may continue to degrade. It is important to contact RSA Identity Governance & Lifecycle Support as soon as possible to obtain a script to remediate the problem request(s) and get assistance on cleaning up the Workpoint monitor queues.
- Consider restoring to a known good database backup from a time before the issue occurred.
Notes