Uploaded image for project: 'JGroups'
  1. JGroups
  2. JGRP-2162

Failed to send broadcast when opening the connection

XMLWordPrintable

    • Icon: Bug Bug
    • Resolution: Won't Do
    • Icon: Major Major
    • 4.0.4
    • None
    • None

      IRC discussion:

      bela_: Hi Bela, I have a weird failure in one test that seem to be rooted in JGroups. TCP_NIO2 is in charge, and there's a broadcast message to all nodes, but it seems it's not received on the other side.
      <bela_> rvansa: reproducible?
      <rvansa> bela_: it happens when the connection to a node is just being opened: I have added some trace logs and just a moment before writing to the NioConnection.send_buf it was in state "connection pending"
      <rvansa> bela_: sort of, after tens of runs of that test (on my machine) - and I've seen it first time in CI, so it could be
      <bela_> rvansa: NioConnection buffers writes up to a certain extent, then discards anything over the buffer limit
      <bela_> rvansa: max_send_buffers (default: 10). But retransmission should fix this, unless you don’t wait long enough
      <rvansa> bela_: I don't think it should go over the limit
      <rvansa> bela_: the test is not doing anything else, just sending CommitCommand (that should be couple hundred bytes at most) and then waiting
      <rvansa> bela_: according to the traces I've added, Buffers.write returned false when writing the local address, and then true when writing the actual message

      I have been trying to write a reproducer, and found that it's related to the fact that the failing test uses custom (fake) discovery protocol, that doesn't open the connection during startup. In my ~reproducer I had to modify tcp-nio.xml to use TCPPING with only the first node in hosts list (localhost[7800]):

      <TCPPING async_discovery="true" initial_hosts="${jgroups.tcpping.initial_hosts:localhost[7800]}" port_range="0"/>
      

      This causes that the physical connection is not opened by discovery. However, the reproducer suffers from (always reproducible) flaw - it does not send the message to third node at all (and the test fails, therefore).
      Note that increasing the timeout in request options does not help.

              rhn-engineering-bban Bela Ban
              rvansa1@redhat.com Radim Vansa (Inactive)
              Votes:
              0 Vote for this issue
              Watchers:
              4 Start watching this issue

                Created:
                Updated:
                Resolved: