-
Bug
-
Resolution: Won't Do
-
Major
-
None
-
None
IRC discussion:
bela_: Hi Bela, I have a weird failure in one test that seem to be rooted in JGroups. TCP_NIO2 is in charge, and there's a broadcast message to all nodes, but it seems it's not received on the other side.
<bela_> rvansa: reproducible?
<rvansa> bela_: it happens when the connection to a node is just being opened: I have added some trace logs and just a moment before writing to the NioConnection.send_buf it was in state "connection pending"
<rvansa> bela_: sort of, after tens of runs of that test (on my machine) - and I've seen it first time in CI, so it could be
<bela_> rvansa: NioConnection buffers writes up to a certain extent, then discards anything over the buffer limit
<bela_> rvansa: max_send_buffers (default: 10). But retransmission should fix this, unless you don’t wait long enough
<rvansa> bela_: I don't think it should go over the limit
<rvansa> bela_: the test is not doing anything else, just sending CommitCommand (that should be couple hundred bytes at most) and then waiting
<rvansa> bela_: according to the traces I've added, Buffers.write returned false when writing the local address, and then true when writing the actual message
I have been trying to write a reproducer, and found that it's related to the fact that the failing test uses custom (fake) discovery protocol, that doesn't open the connection during startup. In my ~reproducer I had to modify tcp-nio.xml to use TCPPING with only the first node in hosts list (localhost[7800]):
<TCPPING async_discovery="true" initial_hosts="${jgroups.tcpping.initial_hosts:localhost[7800]}" port_range="0"/>
This causes that the physical connection is not opened by discovery. However, the reproducer suffers from (always reproducible) flaw - it does not send the message to third node at all (and the test fails, therefore).
Note that increasing the timeout in request options does not help.