Uploaded image for project: 'OpenJDK'
  1. OpenJDK
  2. OPENJDK-3679

Crash due to G1 heap regions not always retired when using multiple NUMA nodes

XMLWordPrintable

    • Icon: Bug Bug
    • Resolution: Done
    • Icon: Blocker Blocker
    • None
    • 21.0.6 GA
    • None
    • None
    • False
    • Hide

      None

      Show
      None
    • False

      This method call replaces the heap region we allocate from (it's called "alloc region") with a free one: https://github.com/openjdk/jdk21u/blob/jdk-21.0.6%2B6/src/hotspot/share/gc/g1/g1CollectedHeap.cpp#L441

      When an alloc region is replaced it's crucial to add the old one to the set of regions that will be processed during the next garbage collection cycle (that set is called "collection set").

      Adding a region to the collection set is called "retirement".

      The call at line 441 doesn't do retirement itself but it happens after and only after the call at line 430, the previous allocation attempt, which does.

      The construction is solid until we introduce NUMA support. With regards to the discussed code, it means that each NUMA node now has its own alloc region. Each thread is associated with a node (through G1Allocator::current_node_index), each node is associated with an alloc region.

      It works well as long as the association between threads and nodes doesn't change but with our load it sometimes does.

      When the association changes between lines 430 and 441, it results in a region that is never retired because line 441 doesn't retire any region and line 430 retires a region associated with a different node.

      A region that is not retired is a lost region that causes the JVM to crash after some time.

      This is an assert that fails when the issue happens: https://github.com/openjdk/jdk21u/blob/jdk-21.0.6%2B6/src/hotspot/share/gc/g1/g1AllocRegion.cpp#L135

      This is our naive fix:
      — a/src/hotspot/share/gc/g1/g1AllocRegion.inline.hpp
      +++ b/src/hotspot/share/gc/g1/g1AllocRegion.inline.hpp
      @@ -117,6 +117,7 @@ inline HeapWord* G1AllocRegion::attempt_allocation_force(size_t word_size) {
         assert_alloc_region(_alloc_region != nullptr, "not initialized properly");

         trace("forcing alloc", word_size, word_size);
      +  retire(true);
         HeapWord* result = new_alloc_region_and_allocate(word_size, true /* force */);
         if (result != nullptr) {
           trace("alloc forced", word_size, word_size, word_size, result);

              rh-ee-tstuefe Thomas Stuefe
              rhn-support-mmillson Michael Millson
              Votes:
              0 Vote for this issue
              Watchers:
              6 Start watching this issue

                Created:
                Updated:
                Resolved: