Uploaded image for project: 'JGroups'
  1. JGroups
  2. JGRP-1618

missingMessageReceived() never called in NAKACK resulting in memory leak

XMLWordPrintable

    • Icon: Bug Bug
    • Resolution: Won't Do
    • Icon: Major Major
    • 3.2.9, 3.3
    • 2.8.1, 2.12.2
    • None
    • Hide

      Run ehcache 2.4 + JGroups under continuous load, and monitor the heap memory usage with JConsole. Select the Memory Tab and watch "Memory Pool CMS Old Gen". The memory usage grows and eventually starts running ConcurrentMarkSweep GC regularly, but fails free up memory.

      Show
      Run ehcache 2.4 + JGroups under continuous load, and monitor the heap memory usage with JConsole. Select the Memory Tab and watch "Memory Pool CMS Old Gen". The memory usage grows and eventually starts running ConcurrentMarkSweep GC regularly, but fails free up memory.

      We are using JGroups 2.8.1 and encountered a memory leak where it eventually ran out of CMS Old Gen memory. The heap dump revealed that the problem was in the xmit_stats ConcurrentHashMap of org.jgroups.protocols.pbcast.NAKACK.

      After much analysis here's what we found: when the system is under load, messages can start arriving out of order. When the receiver receives a higher sequence number than expected, it requests the sender retransmit the missing messages with the lower sequence numbers. The sender sends the missing message, however the bug in the NakReceiverWindow meant that the missing message was never purged from the Map that tracks missing messages (xmit_stats) because missingMessageReceived() was never invoked. Over time this Map grows and starts using up CMS Old Gen; the only way it would get reduced was when a server left the cluster and the missing messages were purged for that server.

      In JMX, the MissingMsgsReceived attribute of jgroups:cluster=*,protocol=NAKACK,type=protocol was always zero, confirming that it never purged any received "missing messages".

      When I looked at the most recent GA version 3.2.8 of NakReceiverWindow.java , it has the corrected logic that ensures that missingMessageReceived() is called. I checked the the most recent 2.x, which is 2.12.2, and found it also has the same bug in the logic as 2.8.1. This bug may apply to other 2.x but I did not check.

      Attached is the fixed NakReceiverWindow.java for 2.8.1.

      After applying the patch, the memory leak went away.

        1. 1618-jgrp-patch.jar
          6 kB
        2. NakReceiverWindow.java
          22 kB
        3. screenshot-1.png
          screenshot-1.png
          63 kB

              rhn-engineering-bban Bela Ban
              hmark_jira Harry Mark (Inactive)
              Votes:
              0 Vote for this issue
              Watchers:
              4 Start watching this issue

                Created:
                Updated:
                Resolved: