Loading...

XML

Word

Printable

Type: Bug
Resolution: Done
Priority: Major
Fix Version/s: 5.5.0
Affects Version/s: 5.4.8
Labels:
None

Blocked:
False
Blocked Reason:

Hide

None

Show
None
Ready:
False

SFDC Cases Counter:
SFDC Cases Open:
SFDC Cases Links:

We encountered an issue using FILE_PING with async_discovery=true. When the coordinator node restarts during discovery (especially when discovery is slow, e.g. due to file read latency), the coordinator enters an infinite loop, repeatedly calling findMembers() and never making progress.

To help reproduce the issue, we created a custom FilePing class extending FILE_PING and added logging:

public class FilePing extends FILE_PING {
    private static final Logger LOGGER = LoggerFactory.getLogger(FilePing.class);

    @Override
    public Responses findMembers(List<Address> members, boolean initial_discovery, boolean async, long timeout) {
        Responses out = super.findMembers(members, initial_discovery, async, timeout);
        LOGGER.info("FilePing.findMembers: members={}, initial_discovery={}, async={}, timeout={}, resp={}",
            members, initial_discovery, async, timeout, out.toString());
        return out;
    }

Observed behavior:

When restarting the coordinator while discovery is slow, we observe repeated log output from findMembers() with no progress:

{{FilePing.findMembers: members=null, initial_discovery=true, async=false, timeout=6000, resp=0 rsps (0 coords) [pending]
FilePing.findMembers: members=null, initial_discovery=true, async=false, timeout=6000, resp=0 rsps (0 coords) [pending]
... (repeats indefinitely)}}

Stack configuration:

<config xmlns="urn:org:jgroups"  xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
        xsi:schemaLocation="urn:org:jgroups http://www.jgroups.org/schema/jgroups.xsd">

    <UDP ip_mcast="false"
         diag.enabled="false"
         ucast_send_buf_size="1M"
         ucast_recv_buf_size="6M"
         mcast_send_buf_size="1M"
         mcast_recv_buf_size="6M"
         thread_pool.enabled="true"
         logical_addr_cache_max_size="10000"
         logical_addr_cache_expiration="3600s"
         bind_addr="${jgroups.udp.ucast.addr}"
         bind_port="${jgroups.udp.ucast.port}"
         port_range="${jgroups.udp.ucast.port.range}"
         thread_pool.min_threads="${jgroups.threads.min}"
         thread_pool.max_threads="${jgroups.threads.max}"
         thread_pool.keep_alive_time="${jgroups.threads.ttl}"/>

    <cn.nextop.gadget.etcd.jgroups.FilePing
        location="/app/erebor/jgroups"
        async_discovery="true" />

    <MERGE3 min_interval="10s" max_interval="30s" />
    <FD_ALL3 interval="6000" timeout="16000" />
    <VERIFY_SUSPECT2 timeout="3s" num_msgs="1" />
    <NAKACK4 xmit_interval="0.3s" capacity="8192"/>
    <UNICAST4 xmit_interval="0.3s" capacity="2048"/>
    <pbcast.GMS join_timeout="6.0s" max_join_attempts="0"/>
    <FRAG2 frag_size="60K"/>
</config>

Additional info:

When we set async_discovery=false, the issue does not occur.

It seems that when the async discovery takes too long to return responses, the coordinator (in ClientGmsImpl.joinInternal()) falls into an infinite retry loop.

We suspect the problem is due to an interaction between async_discovery=true and the discovery result not being available fast enough during initial_discovery.

Environment:

JGroups version:
5.4.8.Final
Java version: java21

- - Sort By Name
  - Sort By Date
  - Ascending
  - Descending
  - Thumbnails
  - List
  - Download All

jgroups-custom.xml
2025/07/17 12:32 PM
2 kB
Bela Ban
MainInProcess.java
2025/07/17 12:32 PM
3 kB
Bela Ban

Assignee:: Bela Ban

Reporter:: chen baoyi

Votes:: 1 Vote for this issue

Watchers:: 2 Start watching this issue

Created:: 2025/06/05 4:35 AM

Updated:: 2025/07/28 11:39 AM

Resolved:: 2025/07/21 12:07 PM

Details

Description

Attachments

Attachments

Easy Agile Planning Poker

Activity

People

Dates