Loading...

Linking RHIVOS CVEs to...

Migration: Automation ...

SWIFT: POC Conversion

Sync from "Extern...

XML

Word

Printable

Type: Bug
Resolution: Won't Do
Priority: Undefined
Fix Version/s: None
Affects Version/s: rhel-8.4.0
Component/s: corosync
Labels:
- MigratedToJIRA
- Triaged

Regression:
None
Severity:
None

AssignedTeam:
rhel-ha

Story Points:
12
Blocked:
False
Ready:
False
Blocked Reason:

Hide

None

Show
None
Product Documentation Required:
None
Sprint:
None

Preliminary Testing:
None
Test Coverage:
None

Release Note Type:
If docs needed, set a value

Experience:
Architecture:

Unspecified
Bugzilla Bug:
RHBZ: 1618775

PX Impact Score:
PX Technical Impact:
PX Priority Data:
PX Review Complete:
SFDC Cases Counter:
SFDC Cases Open:
SFDC Cases Links:

Planning:
None
Internal Target Milestone numeric:
57,005

This is intended to track possible ways to deal of "Retransmit List" messages logged by corosync during the process of forming a cluster. We're occasionally seeing the following on larger clusters with 8-16 nodes:

Aug 17 09:54:42 [12155] east-09.lab.bos.redhat.com corosync notice [TOTEM ] Retransmit List: 1
Aug 17 09:54:42 [12899] east-10.lab.bos.redhat.com corosync notice [TOTEM ] Failed to receive the leave message. failed: 1
Aug 17 09:54:42 [17878] east-11.lab.bos.redhat.com corosync notice [TOTEM ] Retransmit List: 1
Aug 17 09:54:42 [17878] east-11.lab.bos.redhat.com corosync notice [TOTEM ] Failed to receive the leave message. failed: 1
Aug 17 09:54:42 [12346] east-13.lab.bos.redhat.com corosync notice [TOTEM ] Retransmit List: 1
Aug 17 09:54:42 [12346] east-13.lab.bos.redhat.com corosync notice [TOTEM ] Failed to receive the leave message. failed: 1
Aug 17 09:54:42 [11647] east-14.lab.bos.redhat.com corosync notice [TOTEM ] Retransmit List: 1
Aug 17 09:54:42 [20347] east-15.lab.bos.redhat.com corosync notice [TOTEM ] Retransmit List: 1
Aug 17 09:54:42 [10881] east-16.lab.bos.redhat.com corosync notice [TOTEM ] Retransmit List: 1

Eventually the communication goes ham and the list grows like this:

Aug 13 11:41:23 [1979] host-027.virt.lab.msp.redhat.com corosync notice [TOTEM ] Retransmit List: 7 8 9 a b c d e f 10 11 12 13 14 15 16 17 18 19 1a 1b 1c 1d 1e 1f 20 21 22 23 24
Aug 13 11:41:23 [1979] host-027.virt.lab.msp.redhat.com corosync notice [TOTEM ] Retransmit List: 25 26 27 28 29 2a 2b 2c 2d 2e 2f 30 31 32

I have been able to reproduce this on 16-node clusters in multiple labs we have in BRQ and MSP with KVM-based virtual machines, but also on a 7-node physical cluster in BOS lab.

The RHEL base is 7.6 with following pkg versions currently used:
corosync-2.4.3-4.el7.x86_64
pacemaker-1.1.19-6.el7.x86_64
pcs-0.9.165-3.el7.x86_64

By experimenting with various totem options as suggested to me by Jan Friesse I've found the 'send_join' option to be most influential for mitigating this behaviour. On my virtual machine-based 16-node clusters the reproducibility is very high and running a number of startups as high as 500 iterations reveals a threshold of send_join at about 50 where the retransmits are very scarce (~15 occasions of 500). Setting the send_join to 100 seems to be safe enough to eliminate the retransmits altogether (for now, my tests are still running).

The corosync.conf manual page states following for sync_join: "For configurations with less than 32 nodes, this parameter is not necessary". If this is not enough in practice, I am wondering if we could:

reconsider the defaults
update documentation to suggest tuning certain totem options
let pcs determine optimal values somehow

Indeed, as long as the root cause of retransmits is apparently evading me I am not inclined to any specific solution right now.

To aviod performance issues with pcs' way of starting the whole cluster from a single node (which itself seems to be causing other issues such as "Process pause detected for..." messages) I've been running 'pcs cluster start' on all nodes in parallel.

My corosync.conf looks like this (tested both UDP/UDPU with no noticable difference):

totem {
version: 2
cluster_name: STSRHTS8926
secauth: off
transport: udp
send_join: 100
}
nodelist {
node

{ ring0_addr: host-026 nodeid: 1 }

....
node

{ ring0_addr: host-041 nodeid: 16 }

}
quorum {
provider: corosync_votequorum
}
logging {
to_logfile: yes
logfile: /var/log/cluster/corosync.log
to_syslog: yes
timestamp: on
}

(Since this issue is not necessarily a bug in corosync itself, we can change the assigned component as needed.)

external trackers

PnT-DevOps Jira RHELPLAN-37465

Red Hat Bugzilla 1572886

Red Hat Customer Portal 01938524

Red Hat Issue Tracker RHELPLAN-37465

Red Hat Knowledge Base (Solution) 3551671

Assignee:: Jan Friesse

Reporter:: Radek Steiger (Inactive)

Developer:: Jan Friesse

QA Contact:: Cluster QE

Votes:: 0 Vote for this issue

Watchers:: 7 Start watching this issue

Created:: 2023/09/22 6:34 PM

Updated:: 2025/09/13 5:04 AM

Resolved:: 2025/02/19 4:34 PM

Details

Description

Attachments

Issue Links

Easy Agile Planning Poker

Activity

People

Dates