000033705 - Vulnerability Risk Management 1.1 SP1 cluster for RSA Archer runs out of drive space on /opt and MapR services fail to restart

Document created by RSA Customer Support Employee on Aug 9, 2016Last modified by RSA Customer Support Employee on Apr 22, 2017
Version 5Show Document
  • View in full screen mode

Article Content

Article Number000033705
Applies ToRSA Product Set: Security Management
RSA Product/Service Type: Vulnerability Risk Manager
RSA Version/Condition: 1.1 SP1
Platform: Cent OS
IssueThe following error may be presented to the user on the VRM cluster when attempting to start or stop MapR services:
"line 93: /opt/mapr/logs/warden.log: No space left on device"
User-added image
The following alarm may be triggered on the cluster:
NODE_ALARM_OPT_MAPR_FULL
Installation Directory Full
The partition /opt/mapr on the node is running out of space (95% full).

The following result of the "df -h" command indicates 100% utilization of /opt:
User-added image
 
CauseThere is a known defect with VRM prior to versions 1.2 where the /opt/mapr/hadoop/hadoop-0.20.2/pids will continue to grow in size until the /opt volume is out of drive space.
Additionally, if there are any MapR services that are failing (Status 4), for any reason, they will fill their logging directories with detailed logs of their failure and these log files will also grow in size until the /opt volume is out of drive space.
Additionally, if there is a lot of data throughput on the nodes then the TaskTracker logs generated during normal operation of the cluster can grow to be very large, which can fill up the /opt partition since the log retention of MapR is based on days instead of log file size.
Resolution
  1. Follow workaround below to force maintenance jobs to clean up the folders in this partition so that an upgrade can be completed

  2. Upgrade to Vulnerability Risk Management 1.2.


  3. Run the command: ls -al /opt/mapr/hadoop/hadoop-0.20.2/logsUser-added image
  4. Identify whether regular cluster throughput is causing any TaskTracker log files to be above 70 Megabytes per day (70000000).  If so, then complete "Update log.retention.time below". 
(As required) Change log.retention.time of MapR warehouse
This needs to be completed only if regular MapR throughput is generating log files large enough to fill up the /opt partition. (See above)

  1. Stop MapR services on the cluster node


    service mapr-zookeeper stop


    service mapr-warden stop
     



  2. Move to the MapR configuration folder
    cd /opt/mapr/conf
     



  3. Copy the existing configuration file to another file named after the current date
    cp ./warden.conf ./warden08242016.conf
     



  4. Edit the warden configuration file
    vi ./warden.conf
     



  5. Append log.retention.time entry (3 days in milliseconds) to the end of the warden.conf file


    log.retention.time=259200000
     



  6. Start MapR services on the cluster node


    service mapr-zookeeper start


    service mapr-warden start
     



  7. Repeat steps 1-6 for each node in the cluster.


WorkaroundPlease follow the following steps to delete logs older than 7 days and previous Process IDs (PIDS) of MapR services older than 4 days:
  1. Log into a cluster node as root user
  2. Confirm low /opt disk space with the "df" command.
  3. service mapr-zookeeper stop
  4. service mapr-warden stop
  5. find /opt/mapr/hadoop/hadoop-0.20.2/pids/* -mtime +7 -print -delete
  6. find /opt/mapr/hadoop/hadoop-0.20.2/logs/hadoop-mapr* -mtime +4 -print -delete
  7. find /opt/mapr/logs/* -mtime +4 -print -delete
  8. find /opt/mapr/hbase/hbase-0.94.13/logs/*log* -mtime +4 -print -delete
  9. service mapr-zookeeper start
  10. service mapr-warden start
  11. (wait 2 minutes)
  12. maprcli service list -node NODE_NAME         
  13. Verify no MapR services have a status of 4 before proceeding.  If any services are status 4 (FAILED) then these will need to be investigated for other problems.
  14. Repeat steps 1-13 for each cluster node that is low on drive space on the /opt partition.
  15. maprcli alarm clearall    (To clear all old alarms)

Note: NODE_NAME above in Step#12 is the name of the server.



Note: If an upgrade to VRM 1.2 is not completed, then these commands will need to be rerun on each cluster node, as needed, to ensure /opt does fill up.
 

Attachments

    Outcomes