Loading...

XML

Word

Printable

Type: Bug
Resolution: Duplicate
Priority: Undefined
Fix Version/s: None
Affects Version/s: 4.16.z
Component/s: Networking / ovn-kubernetes
Labels:
None

Activity Type:
Quality / Stability / Reliability
Blocked:
False
Blocked Reason:

Hide

None

Show
None
Story Points:
None
Severity:
Critical
Regression:
None
Epic Link:
Support the balance-slb bond mode

Target Backport Versions:
None
Target Version:
None
Release Blocker:
None
Sprint:
CORENET Sprint 273
sprint_count:
1

SFDC Cases Counter:
SFDC Cases Open:
SFDC Cases Links:

PX Impact Score:

Release Note Status:
None
Release Note Type:
None
Release Note Text:
None

Escape Reason:
None
Escape Impact:
None
Corrective Measures:
None
SDLC stage when should've been found:
None

Description of problem:

During a live CNI migration from OpenShiftSDN to OVN-Kubernetes, the cluster enters a deadlocked state, and the migration stalls indefinitely. The root cause is an incorrect OVN Source NAT (SNAT) rule applied to traffic from openshift-apiserver pods to the etcd members. This SNAT rule breaks the required mutual TLS (mTLS) authentication, causing the API server to lose connection to etcd. This failure cascades, destabilizing the authentication operator and OpenShift Data Foundation (Ceph). The unhealthy Ceph OSDs enter a crash loop, which in turn blocks worker nodes from being drained, halting the CNI migration.

Version-Release number of selected component (if applicable):

4.16.38

How reproducible:

Frequent

Seen during a live CNI migration on an internal test cluster with ODF/Ceph installed.

Steps to Reproduce:

Initiate an OpenShiftSDN to OVN-Kubernetes live migration on a cluster with OpenShift Data Foundation (Ceph).
Allow the migration to proceed to the point where it begins draining worker nodes.
Observe that the migration stalls and the MachineConfigPool for workers enters a degraded state.

Actual results:

A cascading failure occurs across the cluster:

The CNI migration is stalled and worker MachineConfigPools are degraded.
Node drains fail with a timeout error because they are unable to evict rook-ceph-osd pods.
rook-ceph-osd pods on other nodes are in a CrashLoopBackOff state because they cannot authenticate with the Ceph monitors.
The authentication cluster operator is degraded with an EOF error when trying to access the OAuth server's healthz endpoint.
The openshift-apiserver pods are failing to connect to etcd with "connection refused" and "i/o timeout" errors.
The etcd pod logs show it is rejecting connections because the client "didn't provide a certificate".

Expected results:

The CNI live migration should proceed and complete successfully. Core cluster components like the API server, etcd, and authentication should remain stable and functional throughout the process. Ceph OSDs should remain healthy enough to allow nodes to drain sequentially.

Additional info:

The root cause was traced to an incorrect OVN logical NAT rule that was implemented as an OpenFlow rule on the OVS bridge. This rule rewrites the source IP of traffic from the openshift-apiserver pod to the IP of its host node, which breaks the etcd mTLS handshake.

Evidence of SNAT rule in OVN Northbound Database:

_uuid               : 479ae961-3f0d-4652-a5c9-b20a5c10a4ce
...
external_ip         : "10.6.158.12"
...
logical_ip          : "10.129.0.9"
...
type                : snat

Evidence of implementing OpenFlow Rule:

... table=45, ... nw_src=10.129.0.9 actions=ct(commit,table=46,zone=...,nat(src=10.6.158.12))

Affected Platforms:

Bare metal

duplicates

OCPBUGS-57484 after OVN-K live migration br0 is still present

Closed

Assignee:: Peng Liu

Reporter:: Ross Brattain

Need Info From:: None

Contributors:: None

QA Contact:: Anurag Saxena

Doc Contact:: None

Votes:: 0 Vote for this issue

Watchers:: 3 Start watching this issue

Created:: 2025/06/27 12:30 AM

Updated:: 2025/07/08 2:50 PM

Resolved:: 2025/06/30 3:25 PM

Details

Description

Evidence of SNAT rule in OVN Northbound Database:

Evidence of implementing OpenFlow Rule:

Affected Platforms:

Attachments

Issue Links

Easy Agile Planning Poker

Activity

People

Dates