-
Bug
-
Resolution: Won't Do
-
Undefined
-
None
-
rhel-8.4.0
-
None
-
None
-
rhel-ha
-
12
-
False
-
False
-
-
None
-
None
-
None
-
None
-
If docs needed, set a value
-
-
Unspecified
-
None
-
57,005
This is intended to track possible ways to deal of "Retransmit List" messages logged by corosync during the process of forming a cluster. We're occasionally seeing the following on larger clusters with 8-16 nodes:
Aug 17 09:54:42 [12155] east-09.lab.bos.redhat.com corosync notice [TOTEM ] Retransmit List: 1
Aug 17 09:54:42 [12899] east-10.lab.bos.redhat.com corosync notice [TOTEM ] Failed to receive the leave message. failed: 1
Aug 17 09:54:42 [17878] east-11.lab.bos.redhat.com corosync notice [TOTEM ] Retransmit List: 1
Aug 17 09:54:42 [17878] east-11.lab.bos.redhat.com corosync notice [TOTEM ] Failed to receive the leave message. failed: 1
Aug 17 09:54:42 [12346] east-13.lab.bos.redhat.com corosync notice [TOTEM ] Retransmit List: 1
Aug 17 09:54:42 [12346] east-13.lab.bos.redhat.com corosync notice [TOTEM ] Failed to receive the leave message. failed: 1
Aug 17 09:54:42 [11647] east-14.lab.bos.redhat.com corosync notice [TOTEM ] Retransmit List: 1
Aug 17 09:54:42 [20347] east-15.lab.bos.redhat.com corosync notice [TOTEM ] Retransmit List: 1
Aug 17 09:54:42 [10881] east-16.lab.bos.redhat.com corosync notice [TOTEM ] Retransmit List: 1
Eventually the communication goes ham and the list grows like this:
Aug 13 11:41:23 [1979] host-027.virt.lab.msp.redhat.com corosync notice [TOTEM ] Retransmit List: 7 8 9 a b c d e f 10 11 12 13 14 15 16 17 18 19 1a 1b 1c 1d 1e 1f 20 21 22 23 24
Aug 13 11:41:23 [1979] host-027.virt.lab.msp.redhat.com corosync notice [TOTEM ] Retransmit List: 25 26 27 28 29 2a 2b 2c 2d 2e 2f 30 31 32
I have been able to reproduce this on 16-node clusters in multiple labs we have in BRQ and MSP with KVM-based virtual machines, but also on a 7-node physical cluster in BOS lab.
The RHEL base is 7.6 with following pkg versions currently used:
corosync-2.4.3-4.el7.x86_64
pacemaker-1.1.19-6.el7.x86_64
pcs-0.9.165-3.el7.x86_64
By experimenting with various totem options as suggested to me by Jan Friesse I've found the 'send_join' option to be most influential for mitigating this behaviour. On my virtual machine-based 16-node clusters the reproducibility is very high and running a number of startups as high as 500 iterations reveals a threshold of send_join at about 50 where the retransmits are very scarce (~15 occasions of 500). Setting the send_join to 100 seems to be safe enough to eliminate the retransmits altogether (for now, my tests are still running).
The corosync.conf manual page states following for sync_join: "For configurations with less than 32 nodes, this parameter is not necessary". If this is not enough in practice, I am wondering if we could:
- reconsider the defaults
- update documentation to suggest tuning certain totem options
- let pcs determine optimal values somehow
Indeed, as long as the root cause of retransmits is apparently evading me I am not inclined to any specific solution right now.
To aviod performance issues with pcs' way of starting the whole cluster from a single node (which itself seems to be causing other issues such as "Process pause detected for..." messages) I've been running 'pcs cluster start' on all nodes in parallel.
My corosync.conf looks like this (tested both UDP/UDPU with no noticable difference):
totem {
version: 2
cluster_name: STSRHTS8926
secauth: off
transport: udp
send_join: 100
}
nodelist {
node
....
node
}
quorum {
provider: corosync_votequorum
}
logging {
to_logfile: yes
logfile: /var/log/cluster/corosync.log
to_syslog: yes
timestamp: on
}
(Since this issue is not necessarily a bug in corosync itself, we can change the assigned component as needed.)