Uploaded image for project: 'JGroups'
  1. JGroups
  2. JGRP-2892

FILE_PING causes infinite retry loop when async_discovery=true and coordinator restarts during slow discovery

XMLWordPrintable

    • Icon: Bug Bug
    • Resolution: Done
    • Icon: Major Major
    • 5.5.0
    • 5.4.8
    • None
    • False
    • Hide

      None

      Show
      None
    • False

      We encountered an issue using FILE_PING with async_discovery=true. When the coordinator node restarts during discovery (especially when discovery is slow, e.g. due to file read latency), the coordinator enters an infinite loop, repeatedly calling findMembers() and never making progress.

      To help reproduce the issue, we created a custom FilePing class extending FILE_PING and added logging:

       

      public class FilePing extends FILE_PING {
          private static final Logger LOGGER = LoggerFactory.getLogger(FilePing.class);
      
          @Override
          public Responses findMembers(List<Address> members, boolean initial_discovery, boolean async, long timeout) {
              Responses out = super.findMembers(members, initial_discovery, async, timeout);
              LOGGER.info("FilePing.findMembers: members={}, initial_discovery={}, async={}, timeout={}, resp={}",
                  members, initial_discovery, async, timeout, out.toString());
              return out;
          }
      

      Observed behavior:

      When restarting the coordinator while discovery is slow, we observe repeated log output from findMembers() with no progress:

       

      {{FilePing.findMembers: members=null, initial_discovery=true, async=false, timeout=6000, resp=0 rsps (0 coords) [pending]
      FilePing.findMembers: members=null, initial_discovery=true, async=false, timeout=6000, resp=0 rsps (0 coords) [pending]
      ... (repeats indefinitely)}}


      Stack configuration:

       

      <config xmlns="urn:org:jgroups"  xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
              xsi:schemaLocation="urn:org:jgroups http://www.jgroups.org/schema/jgroups.xsd">
      
          <UDP ip_mcast="false"
               diag.enabled="false"
               ucast_send_buf_size="1M"
               ucast_recv_buf_size="6M"
               mcast_send_buf_size="1M"
               mcast_recv_buf_size="6M"
               thread_pool.enabled="true"
               logical_addr_cache_max_size="10000"
               logical_addr_cache_expiration="3600s"
               bind_addr="${jgroups.udp.ucast.addr}"
               bind_port="${jgroups.udp.ucast.port}"
               port_range="${jgroups.udp.ucast.port.range}"
               thread_pool.min_threads="${jgroups.threads.min}"
               thread_pool.max_threads="${jgroups.threads.max}"
               thread_pool.keep_alive_time="${jgroups.threads.ttl}"/>
      
          <cn.nextop.gadget.etcd.jgroups.FilePing
              location="/app/erebor/jgroups"
              async_discovery="true" />
      
          <MERGE3 min_interval="10s" max_interval="30s" />
          <FD_ALL3 interval="6000" timeout="16000" />
          <VERIFY_SUSPECT2 timeout="3s" num_msgs="1" />
          <NAKACK4 xmit_interval="0.3s" capacity="8192"/>
          <UNICAST4 xmit_interval="0.3s" capacity="2048"/>
          <pbcast.GMS join_timeout="6.0s" max_join_attempts="0"/>
          <FRAG2 frag_size="60K"/>
      </config>
      

      Additional info:

      • When we set async_discovery=false, the issue does not occur.
      • It seems that when the async discovery takes too long to return responses, the coordinator (in ClientGmsImpl.joinInternal()) falls into an infinite retry loop.
      • We suspect the problem is due to an interaction between async_discovery=true and the discovery result not being available fast enough during initial_discovery.

      Environment:

      • JGroups version: 
        5.4.8.Final
      • Java version: java21

       

        1. jgroups-custom.xml
          2 kB
          Bela Ban
        2. MainInProcess.java
          3 kB
          Bela Ban

              rhn-engineering-bban Bela Ban
              leon_a chen baoyi
              Votes:
              1 Vote for this issue
              Watchers:
              2 Start watching this issue

                Created:
                Updated:
                Resolved: