-
Bug
-
Resolution: Duplicate
-
Undefined
-
None
-
4.16.z
-
None
-
Quality / Stability / Reliability
-
False
-
-
None
-
Critical
-
None
-
None
-
None
-
None
-
CORENET Sprint 273
-
1
-
None
-
None
-
None
-
None
-
None
-
None
-
None
Description of problem:
During a live CNI migration from OpenShiftSDN to OVN-Kubernetes, the cluster enters a deadlocked state, and the migration stalls indefinitely. The root cause is an incorrect OVN Source NAT (SNAT) rule applied to traffic from openshift-apiserver pods to the etcd members. This SNAT rule breaks the required mutual TLS (mTLS) authentication, causing the API server to lose connection to etcd. This failure cascades, destabilizing the authentication operator and OpenShift Data Foundation (Ceph). The unhealthy Ceph OSDs enter a crash loop, which in turn blocks worker nodes from being drained, halting the CNI migration.
Version-Release number of selected component (if applicable):
4.16.38
How reproducible:
Frequent
Seen during a live CNI migration on an internal test cluster with ODF/Ceph installed.
Steps to Reproduce:
- Initiate an OpenShiftSDN to OVN-Kubernetes live migration on a cluster with OpenShift Data Foundation (Ceph).
- Allow the migration to proceed to the point where it begins draining worker nodes.
- Observe that the migration stalls and the MachineConfigPool for workers enters a degraded state.
Actual results:
A cascading failure occurs across the cluster:
- The CNI migration is stalled and worker MachineConfigPools are degraded.
- Node drains fail with a timeout error because they are unable to evict rook-ceph-osd pods.
- rook-ceph-osd pods on other nodes are in a CrashLoopBackOff state because they cannot authenticate with the Ceph monitors.
- The authentication cluster operator is degraded with an EOF error when trying to access the OAuth server's healthz endpoint.
- The openshift-apiserver pods are failing to connect to etcd with "connection refused" and "i/o timeout" errors.
- The etcd pod logs show it is rejecting connections because the client "didn't provide a certificate".
Expected results:
The CNI live migration should proceed and complete successfully. Core cluster components like the API server, etcd, and authentication should remain stable and functional throughout the process. Ceph OSDs should remain healthy enough to allow nodes to drain sequentially.
Additional info:
The root cause was traced to an incorrect OVN logical NAT rule that was implemented as an OpenFlow rule on the OVS bridge. This rule rewrites the source IP of traffic from the openshift-apiserver pod to the IP of its host node, which breaks the etcd mTLS handshake.
Evidence of SNAT rule in OVN Northbound Database:
_uuid : 479ae961-3f0d-4652-a5c9-b20a5c10a4ce ... external_ip : "10.6.158.12" ... logical_ip : "10.129.0.9" ... type : snat
Evidence of implementing OpenFlow Rule:
... table=45, ... nw_src=10.129.0.9 actions=ct(commit,table=46,zone=...,nat(src=10.6.158.12))
Affected Platforms:
Bare metal
- duplicates
-
OCPBUGS-57484 after OVN-K live migration br0 is still present
-
- Verified
-