Resolution: Done
We are running our JVMs with : -XX:OnOutOfMemoryError="kill -9 %p"
we have been experiencing OOMs fairly often, and the OOMs happen at:
Object / Stack Frame |Name | Shallow Heap | Retained Heap |Context Class Loader |Is Daemon --------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- java.lang.Thread @ 0x81bdf838 |Connection.Receiver [ -],sis-cluster.service,prodpmwsv5-6461| 120 | 456 |sun.misc.Launcher$AppClassLoader @ 0x800175a8|false |- at java.lang.OutOfMemoryError.<init>()V (OutOfMemoryError.java:48) | | | | | |- at org.jgroups.blocks.cs.TcpConnection$Receiver.run()V (TcpConnection.java:310)| | | | | |- at java.lang.Thread.run()V (Thread.java:745) | | | | | ---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
the Code where it happens is in TcpConnection.java:
while(canRun()) { try { int len=in.readInt(); if(buffer == null || buffer.length < len) buffer=new byte[len]; in.readFully(buffer, 0, len); updateLastAccessed(); server.receive(peer_addr, buffer, 0, len); } catch(OutOfMemoryError mem_ex) { t=mem_ex; break; // continue; } catch(IOException io_ex) { t=io_ex; break; } catch(Throwable e) { } }
when allocating: buffer=new byte[len];
it looks to me that some invalid large value is received and the process OOMs when allocating a huge byte array
Running JVMs without kill on OOM would make this issue "dissapear" in the sense that it is swallowed by:
catch(OutOfMemoryError mem_ex) { t=mem_ex; break; // continue; }
Handling OutOfMemoryError is a strange implementation choice...
instead a size limit should be employed to protect from receiving invalid sizes...
My heap limit is 1GB and my heap dumps are 50Mb so the attempted allocation size is huge...