Loading...

Linking RHIVOS CVEs to...

Migration: Automation ...

SWIFT: POC Conversion

Sync from "Extern...

XML

Word

Printable

Type: Bug
Resolution: Done-Errata
Priority: Normal
Fix Version/s: None
Affects Version/s: rhel-8.8.0, rhel-8.8.0.z, rhel-9.2.0, rhel-9.2.0.z
Component/s: kronosnet
Labels:
- Triaged

Fixed in Build:
kronosnet-1.28-1.el8
Regression:
None
Severity:
Moderate
Keywords:

Patch, Upstream, TestCaseProvided

AssignedTeam:
rhel-ha
Sub-System Group:

ssg_filesystems_storage_and_HA

Story Points:
None
ACKs Check:

QE ack
Blocked:
False
Ready:
False
Blocked Reason:

Hide

None

Show
None
Product Documentation Required:
None
Products:

Red Hat Enterprise Linux
Sprint:
None

Preliminary Testing:
Pass
Errata Link:
https://errata.devel.redhat.com/advisory/123988
Test Coverage:
None

Experience:

SFDC Cases Counter:
SFDC Cases Open:
SFDC Cases Links:

Planning:
None

What were you trying to do that didn't work?

pcs cluster start --all
on a 5 nodes cluster, where one node is older hw generation than others.

the cluster formed 2 partitions:

one by one node (older hw, yet started faster)
one by 4 nodes (took longer to start corosync but they did start in sync)

the single node attempted to send membership packets to the other nodes, but those were rejected because the 4 nodes were still initializing the knet links. by the time the 4 nodes have completed init with the slow node, they decided that node1 was not part of the membership and fenced the node.

Please provide the package NVR for which bug is seen:

all of them, this is a design decision in knet that existed forever.

How reproducible:

always

Steps to reproduce

This is a complex race condition to reproduce on normal clusters. So far I have seen this problem only on one BM cluster that is currently used for SAS workload calibration.

It is possible to reproduce it manually all the time, it´s just a bit inconvenient.

Create a 2 node cluster. In order to simulate the failure, we will need 2 different corosync.conf on each node:

node1:

totem {
    version: 2
    secauth: on
    cluster_name: demo
    crypto_cipher: aes256
    crypto_hash: sha256
    config_version: 1
}

nodelist {
    node {
        name: rhel8-node1
        ring0_addr: 192.168.9.41
        nodeid: 1
    }

    node {
        name: rhel8-node2
        ring0_addr: 192.168.9.42
        nodeid: 2
    }
}

quorum {
    provider: corosync_votequorum
    two_node: 1
}

logging {
    debug: on
    to_logfile: yes
    logfile: /var/log/cluster/corosync.log
    to_syslog: yes
    timestamp: on
}

pretty much standard corosync.conf.

For the second node, we need to tweak token and pong_count to delay the knet link initialization code.

node2:

totem {
    version: 2
    secauth: on
    cluster_name: demo
    crypto_cipher: aes256
    crypto_hash: sha256
    config_version: 1
    token: 30000
    interface {
        linknumber: 0
        knet_pong_count: 30
    }
}

nodelist {
    node {
        name: rhel8-node1
        ring0_addr: 192.168.9.41
        nodeid: 1
    }

    node {
        name: rhel8-node2
        ring0_addr: 192.168.9.42
        nodeid: 2
    }
}

quorum {
    provider: corosync_votequorum
    two_node: 1
}

logging {
    debug: on
    to_logfile: yes
    logfile: /var/log/cluster/corosync.log
    to_syslog: yes
    timestamp: on
}

Actual results

with the current version of knet, node2 will reject membership packets with:

Sep 19 04:59:15 debug [KNET ] rx: host: 1 link: 0 received pong: 5
Sep 19 04:59:16 debug [KNET ] rx: Source host 1 not reachable yet. Discarding packet.
Sep 19 04:59:16 debug [KNET ] rx: host: 1 link: 0 received pong: 6
Sep 19 04:59:16 debug [KNET ] rx: Source host 1 not reachable yet. Discarding packet.
Sep 19 04:59:17 debug [KNET ] rx: host: 1 link: 0 received pong: 7
Sep 19 04:59:17 debug [KNET ] rx: Source host 1 not reachable yet. Discarding packet.
Sep 19 04:59:17 debug [KNET ] rx: host: 1 link: 0 received pong: 8
Sep 19 04:59:17 debug [KNET ] rx: Source host 1 not reachable yet. Discarding packet.
Sep 19 04:59:18 debug [KNET ] rx: host: 1 link: 0 received pong: 9
Sep 19 04:59:18 debug [KNET ] rx: Source host 1 not reachable yet. Discarding packet.
Sep 19 04:59:18 debug [KNET ] rx: host: 1 link: 0 received pong: 10
Sep 19 04:59:19 debug [KNET ] rx: Source host 1 not reachable yet. Discarding packet.
Sep 19 04:59:19 debug [KNET ] rx: host: 1 link: 0 received pong: 11
Sep 19 04:59:19 debug [KNET ] rx: Source host 1 not reachable yet. Discarding packet.

causing the creation of the 2 membership above.

The new code instead is able to better deal with this situation and it will immediately up the link and form membership:

Sep 19 05:00:58 debug [KNET ] rx: host: 1 link: 0 received pong: 1
Sep 19 05:00:58 debug [KNET ] rx: host: 1 link: 0 received pong: 2
Sep 19 05:00:59 debug [KNET ] rx: host: 1 link: 0 received pong: 3
Sep 19 05:00:59 debug [TOTEM ] Knet pMTU change: 421
Sep 19 05:00:59 debug [KNET ] rx: host: 1 link: 0 received data during valid ping/pong activity. Force link up.

links to

RHBA-2023:123988 kronosnet bug fix and enhancement update

Assignee:: Christine Caulfield

Reporter:: Fabio Massimo Di Nitto

Contributors:: Barry Marson, Christine Caulfield, Jan Friesse, Patrik Hagara

Developer:: Christine Caulfield

QA Contact:: Patrik Hagara

Votes:: 0 Vote for this issue

Watchers:: 5 Start watching this issue

Created:: 2023/09/19 9:02 AM

Updated:: 2024/09/23 9:13 PM

Resolved:: 2024/04/30 11:02 AM

Release Date:: 2024/04/30

Details

Description

What were you trying to do that didn't work?

Please provide the package NVR for which bug is seen:

How reproducible:

Steps to reproduce

Actual results

Attachments

Issue Links

Easy Agile Planning Poker

Activity

People

Dates