000028963 - Authentication and replication down on one Authentication Manager 7.1 server; Security Console down if server is primary

Document created by RSA Customer Support Employee on Jun 14, 2016Last modified by RSA Customer Support Employee on Apr 21, 2017
Version 2Show Document
  • View in full screen mode

Article Content

Article Number000028963
Applies ToRSA Product Set: SecurID
RSA Product/Service Type: Authentication Manager
RSA Version/Condition: 7.1 SP4 after patch 30
 
Issue
  • Security Console not accessible (HTTP error 503) if issue is on the primary.
  • Replication down to the affected server.
  • Replication OK from the affected server.
  • Affected server not responding to authentication requests.  Authentications should be automatically diverted by agents to unaffected RSA Authentication Manager servers.
  • Oracle alert log contains events which indicate a database deadlock is occurring, for example:
    • Errors indicating Oracle was unable to acquire a lock, such as:

ORA-20002: the error occurred with remote OOB invocation. ORA-20007: ORA-20008: ORA-20667: Unable to Acquire Lock for IMS_BATCH_MERGE_LOCKAM_TOKEN_OOB
ORA-20002: the error occurred with remote OOB invocation. ORA-20667: Unable to Acquire Lock for IMS_BATCH_REPLICATION_LOCK


  • Evidence of a long running transaction (LRT), for example:

C001: long running txn detected, xid: 0x0003.014.0001b58b


  • Many wait events will be logged for different sids continuously waiting on other sids.  If a wait does not clear, it will be logged every 300 seconds and the number of seconds shown will increase accordingly.  Note that some waits will occur occasionally under normal operation, but will clear eventually with no intervention.  If a deadlock is occurring it will not clear automatically and the number of seconds value will increase to 1000 or more and the events will continue to be logged periodically until the issue is resolved.  For example:

A003: warning – apply server 1, sid 426 waiting on user sid 413 for event (since 6013 seconds):
A001: warning – apply server 2, sid 393 waiting on user sid 382 for event (since 6316 seconds):
A005: warning – apply server 1, sid 382 waiting on user sid 394 for event (since 6319 seconds):
A003: warning – apply server 1, sid 426 waiting on user sid 413 for event (since 6313 seconds):
A001: warning – apply server 2, sid 393 waiting on user sid 382 for event (since 6618 seconds):
A005: warning – apply server 1, sid 382 waiting on user sid 394 for event (since 6620 seconds):


Note that all of the events above will be interleaved with other normal events and not necessarily all together as shown above.  Search the alert log file to see if waits are clearing or remaining.
 

The Oracle alert log is located in <am_home>/db/admin/<Oracle_SID>/bdump/alert_<Oracle_SID>.log  in UNIX/Linux/RSA SecurID Appliance and in C:\Program Files\RSA Security\RSA Authentication Manager\db\admin\<Oracle_SID>\bdump\alert_<Oracle_SID>.log.

 
CauseThe issue is caused by a database deadlock within the Authentication Manager's unerlying database.
Workaround

On a UNIX or Linux Server or RSA SecurID 3.0 Appliance


1.  Open an SSH session to the server and sudo to the rsaadmin user.
2.  Navigate to <am_home>/utils (e. g., /usr/local/RSASecurity/RSAAuthenticationManager/utils).
3.  Optionally, gather diagnostic information now to send later to RSA Support.  This can only be done before step 4.

./rsautil manage-database -a exec-sql -f diagnostics/IMS_transRpt.sql -A /tmp/transRpt.html


4.  As an immediate workaround to release the database deadlock, stop and restart the database.  This will take a few minutes to execute.:

./rsautil manage-database -a stop-db 
./rsautil manage-database -a start-db


5.  Navigate to <am_home>/server and confirm that all services are now restored:

  1. ./rsaam status all

6.  If diagnostic information was gathered in step 3, send the file /tmp/transRpt.html to RSA Support to analyze.
Any time up to 7 days after the event, additional diagnostic data about the event can also be gathered by running an AWR report.  This may help find the cause of the deadlock, especially if step 3 above was omitted.  RSA Support will provide the instructions to obtain an AWR, if it is required.

On a Windows Server


1.  RDP to the primary or connect directly to it with a keyboard, monitor and mouse.

2.  Open a command prompt and navigate to <am_home>\utils (e. g., C:\Program Files\RSA Security\RSA Authentication Manager\utils).

3.  Optionally, gather diagnostic information now to send later to RSA Support.  This can only be done before step 4.

 

C:\Program Files\RSA Security\RSA Authentication Manager\utils> rsautil manage-database -a exec-sql -f "C:\Program Files\RSA Security\RSA Authentication Manager\diagnostics\IMS_transRpt.sql" -A "C:\Program Files\RSA Security\RSA Authentication Manager\diagnostics\transRpt.html"


4.  As an immediate workaround to release the database deadlock, stop and restart the database through Windows Services  This will take a few minutes to execute.:
5.  Refresh Windows Services to confirm that all services are now restored:
6.  If diagnostic information was gathered in step 3, send the file /tmp/transRpt.html to RSA Support to analyze.

Attachments

    Outcomes