000033713 - Fix Kernel Panic due to Out of Memory Killer in RSA Security Analytics 10.x

Document created by RSA Customer Support Employee on Aug 8, 2016Last modified by RSA Customer Support Employee on Apr 21, 2017
Version 2Show Document
  • View in full screen mode

Article Content

Article Number000033713
Applies ToRSA Product Set: RSA Netwitness Logs and Packets
RSA Product/Service Type: All Netwitness Logs and Packets Nodes.
RSA Version/Condition: 10.x
Issue
  • Service keeps respawning with different Process ID PID, then service is killed by kernel.
  • This happens when server is highly utilized, and more than one process is intensively using the memory.
  • Kernel keeps throwing error messages about OOM Killer invoked "Out Of Memory Killer".
  • Customer has to start the service manually.
Cause
  • This problem happens because of low RAMs found in the server, or in other words, more utilization of concurrent processes than the memory can actually handle.
  • By Default, Kernel has the value vm.overcommit_memory set to '0'. this means that whenever an application requests memory to be allocated "program calls a malloc() function", kernel will always provide such memory addresses requested, hoping that all these applications will never actually utilize these addresses, or at least not at the same time. only when these programs try to use the allocated memory through read/write, it will be marked as truly used.
  • Problem happens when these programs truly use these allocations at the same time, which makes the kernel out of actual memory to provide for these programs. hence the overcommit strategy totally fails. kernel then sacrifices one of the processes and invokes the out of memory killer module "oom-killer" to send the kill signal to this process. this can be noticed in your kernel panic logs in /var/log/messages as below

  • Jun 13 12:56:41 Dec1 kernel: NwWarehouseConn invoked oom-killer: gfp_mask=0x280da, order=0, oom_adj=0, oom_score_adj=0
    Jun 13 12:56:41 Dec1 kernel: Out of memory: Kill process 14010 (NwDecoder) score 326 or sacrifice child
    Jun 13 12:56:41 Dec1 kernel: Killed process 14010, UID 0, (NwDecoder) total-vm:101376128kB, anon-rss:23774892kB, file-rss:616kB
    Jun 13 12:56:41 Dec1 collectd[28205]: NgNativeReader_NwWarehouseConnector-FastUpdate: nwsdk failure: NwSendMessage returned 0; code 109; error: 60 second timeout reached waiting for server response
    Jun 13 12:56:41 Dec1 init: nwdecoder main process (14010) killed by KILL signal
    Jun 13 12:56:41 Dec1 init: nwdecoder main process ended, respawning


    From the logs, you can notice that the oom-killer module was invoked by kernel, then sacrificed the child, then sent kill signal, then finally process was respawned with another PID.
Resolution
  • Solution is to set the vm.overcommit_memory to '2' which will make sure that kernel never overcommits, which means never allocate memory to processes unless it is sure that this memory can truly be assigned to this process, hence no assumptions are done. meaning, only give memory when you are sure you can fulfil this promise.

  • [root@Dec1 ~]# echo "vm.overcommit_memory=2" >> /etc/sysctl.conf ;sysctl -p 2> /dev/null

  • The virtual space which is used for the calculations for overcommitting , is calculated by (Swap size + 0.5*(Ram size)), hence it is always recommended to configure a swap size of at least same size of RAM and up to double the size of RAM. The parameter vm.overcommit_memory set to '2' requires configuring the swap size to a bigger size than the RAM. this is as per https://access.redhat.com/documentation/en-US/Red_Hat_Enterprise_Linux/6/html/Performance_Tuning_Guide/s-memory-captun.html
Workaround
  • Possible workarounds is to set the file /proc/$PID/oom_adj to -17 , which will exclude the specific PID from being killed by the OOM killer module.
  • For example if you wish to exclude the nwconcentrator service from being killed, you can issue the below line in the crontab

  • * * * * * PID=$(cat /var/run/nwconcentrator.pid) ; echo -17 > /proc/$PID/oom_adj

  • The /proc/$PID/oom_adj file will always contain the oom value set by default by the kernel for each PID, however you can set them manually. it can take values from -15 "avoid oom kill as much as possible" up to +15 "encourage to kill this PID as much as possible". also a value of -17 means never kill this PID. 
  • Though this, will maintain that these processes will never ever be killed, it can still cause other processes to be killed.
NotesIf you are having the same problem, but you are unsure of these steps, please contact support@rsa.com

Attachments

    Outcomes