Loading...

Type: Bug
Resolution: Can't Do
Priority: Major
Fix Version/s: rhel-9.7
Affects Version/s: rhel-9.4, rhel-9.5
Component/s: NetworkManager
Labels:
- HSR
- Networking
- PRP
- edge
- highavailability
- industrial
- nmstate

Regression:
No
Severity:
Important

AssignedTeam:
rhel-net-mgmt
Sub-System Group:

ssg_networking

Story Points:
5
Blocked:
False
Ready:
False
Blocked Reason:

Hide

None

Show
None
Product Documentation Required:
None
Products:

Red Hat Enterprise Linux
Sprint:
None

Acceptance Criteria:
Hide

Definition of Done:

Please mark each item below with ( / ) if completed or ( x ) if incomplete:

( ) The acceptance criteria defined below are met.

Given two physical interfaces (port1 and port2) being configured for PRP through nmstate,

When the system administrator applies a valid configuration ensuring both interfaces share the same MAC address,

Then PRP should consistently drop duplicate packets and achieve near zero packet loss during failover tests under typical load conditions.

Definition of Done:

The implementation meets the acceptance criteria

Integration tests are written and pass

The official Red Hat documentation and nmstate upstream documentations are updated to clarify that ports must have the same MAC, if `supervision-address` is required for fine-tuning, it's made configurable or documented.

( ) Code changes are included in a downstream build attached to an errata.

( ) All required testing (manual and/or automated) passes successfully.

( ) Related documentation updates (if applicable) have been completed.
Show
Definition of Done: Please mark each item below with ( / ) if completed or ( x ) if incomplete: ( ) The acceptance criteria defined below are met. Given two physical interfaces (port1 and port2) being configured for PRP through nmstate, When the system administrator applies a valid configuration ensuring both interfaces share the same MAC address, Then PRP should consistently drop duplicate packets and achieve near zero packet loss during failover tests under typical load conditions. Definition of Done: The implementation meets the acceptance criteria Integration tests are written and pass The official Red Hat documentation and nmstate upstream documentations are updated to clarify that ports must have the same MAC, if `supervision-address` is required for fine-tuning, it's made configurable or documented. ( ) Code changes are included in a downstream build attached to an errata. ( ) All required testing (manual and/or automated) passes successfully. ( ) Related documentation updates (if applicable) have been completed.
Preliminary Testing:
None
Test Coverage:
None

Experience:

PX Impact Score:
SFDC Cases Counter:
SFDC Cases Open:
SFDC Cases Links:

Planning:
None

What were you trying to do that didn't work?

On RHEL 9.4, observe and manage HSR/PRP interfaces using nmstate, based on the sample manifest available upstream https://github.com/nmstate/nmstate/pull/2469#issue-2011996438 , doesn't works as expected/defined by PRP protocol. But everything works back as expected when managing the HSR interface with ifconfig https://lwn.net/Articles/826386/ or setting port 1 and port2 with mac address with nmstate before declaring the HSR interface.

When configured through nmstate and without setting port 1 and port 2 with same mac address, it is observed intermittently that the PRP interface is receiving on both ports but sometimes does not drop the duplicate packets[1], so duplicate messages make it to the application. After a wireshark/tcpdump inspection, it appears that the redundant messages (which are supposed to be identical) were using a different MAC address.

One of the steps in setup of PRP using standard commands is to set the MAC address of both interfaces to be the same, and as nmstate doesn't apply this configuration on hsr interfaces by default and it looks like that defining is the mac address of the ports is a required step.

Then it was evaluated the following options:

"supervision-address" field, but at this time it is still read-only, https://github.com/nmstate/nmstate/commit/b23da648e49593c9919e11dcd9a65d3d423fe868#diff-d74df62b06b50e06e830190f130b2cd29f8336dae26d668ffd54edff8aaff512R57

[root@rhel94-local-prp1 ~]# head hsr0.yaml
---
interfaces:
  - name: hsr0
    type: hsr
    state: up
    hsr:
      port1: enp7s0
      port2: enp8s0
      supervision-address: 52:54:00:73:72:76
      multicast-spec: 40
[root@rhel94-local-prp1 ~]# nmstatectl apply hsr0.yaml
(..)
[2025-01-17T15:19:43Z WARN  nmstate::ifaces::hsr] The supervision-address is read-only, ignoring it on desired state.

setting port 1 and port2 mac address with nmstate, seems to be a validate solution but it is not documented upstream or downstream within nmstate for HSR/PRP explicitly. The remaining problem is that it is still observed ~0.0909091% packet loss during failover when nodes are under high network bandwidth workload, which we are not sure if it is still a problem based on statements of "zero packet loss" about HSR/PRP protocol.

[root@rhel94-local-prp1 ~]# cat hsr0.yaml
---
interfaces:
  - name: enp7s0
    type: ethernet
    state: up
    mac-address: 52:54:00:18:6d:48
   - name: enp8s0
    type: ethernet
    state: up
    mac-address: 52:54:00:73:72:76 
  - name: hsr0
    type: hsr
    state: up
    hsr:
      port1: enp7s0
      port2: enp8s0
      multicast-spec: 40
      protocol: prp
    ipv4:
      enabled: true
      dhcp: false
      address:
      - ip: 192.168.200.10
        prefix-length: 24
      auto-dns: false
      auto-gateway: false
      auto-routes: false
[root@rhel94-local-prp1 ~]# ip a l
(..) 
3: enp7s0: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc fq_codel state UP group default qlen 1000
    link/ether 52:54:00:18:6d:48 brd ff:ff:ff:ff:ff:ff
4: enp8s0: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc fq_codel state UP group default qlen 1000
    link/ether 52:54:00:18:6d:48 brd ff:ff:ff:ff:ff:ff permaddr 52:54:00:1e:ac:12
5: hsr0: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1494 qdisc noqueue state UP group default qlen 1000
    link/ether 52:54:00:18:6d:48 brd ff:ff:ff:ff:ff:ff
    inet 192.168.200.20/24 brd 192.168.200.255 scope global noprefixroute hsr0
       valid_lft forever preferred_lft forever

ICMP stats and iperf (VMs with 2 vCPUs and 4GB) stats running failover tests: 

--- 192.168.200.10 ping statistics ---
6600 packets transmitted, 6594 received, 0.0909091% packet loss, time 6757422ms
rtt min/avg/max/mdev = 0.064/0.458/4.251/0.134 ms

Accepted connection from 192.168.200.20, port 47564
[  5] local 192.168.200.10 port 5201 connected to 192.168.200.20 port 47570
[ ID] Interval           Transfer     Bitrate
[  5]   0.00-5.00   sec   180 MBytes   302 Mbits/sec                  
[  5]   5.00-10.00  sec   195 MBytes   327 Mbits/sec                  
[  5]  10.00-15.00  sec   188 MBytes   315 Mbits/sec                  
[  5]  15.00-20.00  sec   183 MBytes   307 Mbits/sec                  
[  5]  20.00-25.00  sec   195 MBytes   328 Mbits/sec                  
[  5]  25.00-30.00  sec   194 MBytes   325 Mbits/sec                  
[  5]  30.00-35.00  sec   177 MBytes   298 Mbits/sec                  
[  5]  35.00-40.00  sec   486 MBytes   815 Mbits/sec                  
[  5]  40.00-45.00  sec   691 MBytes  1.16 Gbits/sec                  
[  5]  45.00-50.00  sec   700 MBytes  1.17 Gbits/sec                  
[  5]  50.00-55.01  sec   655 MBytes  1.10 Gbits/sec                  
[  5]  55.01-60.00  sec   636 MBytes  1.07 Gbits/sec                  
[  5]  60.00-60.04  sec  4.00 MBytes   789 Mbits/sec                  
- - - - - - - - - - - - - - - - - - - - - - - - -
[ ID] Interval           Transfer     Bitrate
[  5]   0.00-60.04  sec  4.38 GBytes   626 Mbits/sec                  receiver

[1] https://en.wikipedia.org/wiki/Parallel_Redundancy_Protocol & https://wiki.wireshark.org/PRP

What is the impact of this issue to you?

PRP is designed to provide zero-time recovery and allows to check the redundancy continuously to detect lurking failures.
At this moment is HSR/PRP still TP with RHEL 9.4, and we are looking to become GA for production grade deployments.

When setting port 1 and port2 with mac address through nmstate, seems to be a validate solution but it is not documented upstream or downstream within nmstate. So, a supportability review is needed to guide us with best practices.

Finally, we would like to understand why "supervision-address" is a read-only field at this moment and if does impact on the way that PRP works.

Please provide the package NVR for which the bug is seen:

[root@rhel94-local-prp1 ~]# cat /etc/redhat-release 
Red Hat Enterprise Linux release 9.4 (Plow)
[root@rhel94-local-prp1 ~]# uname -a
Linux rhel94-local-prp1 5.14.0-427.42.1.el9_4.x86_64 #1 SMP PREEMPT_DYNAMIC Fri Oct 18 14:35:40 EDT 2024 x86_64 x86_64 x86_64 GNU/Linux
[root@rhel94-local-prp1 ~]# lsmod |grep hsr
hsr                    57344  0 [root@rhel94-local-prp1 ~]# dnf info nmstate
Updating Subscription Management repositories.
Last metadata expiration check: 1:18:33 ago on Wed 22 Jan 2025 12:08:00 PM WET.
Installed Packages
Name         : nmstate
Version      : 2.2.39
Release      : 1.el9_5
Architecture : x86_64
Size         : 10 M
Source       : nmstate-2.2.39-1.el9_5.src.rpm
Repository   : @System
From repo    : rhel-9-for-x86_64-appstream-rpms
Summary      : Declarative network manager API
URL          : https://github.com/nmstate/nmstate

How reproducible is this bug?:

Always

Steps to reproduce

sample manifest available upstream https://github.com/nmstate/nmstate/pull/2469#issue-2011996438
Working manifest @ KB https://access.redhat.com/solutions/7103424

Expected results

PRP provides zero-time recovery and allows to check the redundancy continuously to detect lurking failures.

Actual results

With current upstream sample manifests, nmstate seems to not be able to deliver the level of availability as expected/defined by PRP protocol.
As suggested by https://access.redhat.com/solutions/7103424, before declaring the hsr interface, we are making sure that port1 and port2 are configured with the same MAC Address. This address is typically inherited from port1. See more at https://lwn.net/Articles/826386/ But even with this second config, the remaining problem is that it is still observed ~0.0909091% packet loss during failover when nodes are under high network bandwidth workload, which we are not sure if it is still a problem based on statements of "zero packet loss" about HSR/PRP protocol.

is related to

RFE-4762 Support for Parallel Redundancy Protocol (PRP) and High-availability Seamless Redundancy (HSR)

Closed

Details

Description

What were you trying to do that didn't work?

What is the impact of this issue to you?

Please provide the package NVR for which the bug is seen:

How reproducible is this bug?:

Steps to reproduce

Expected results

Actual results

Attachments

Issue Links

Easy Agile Planning Poker

Activity

People

Dates

Hide