Uploaded image for project: 'OpenShift Bugs'
  1. OpenShift Bugs
  2. OCPBUGS-58182

OVN-K live migration fails to node drain due to oauth/apis failures

XMLWordPrintable

    • Icon: Bug Bug
    • Resolution: Duplicate
    • Icon: Undefined Undefined
    • None
    • 4.16.z
    • None
    • None
    • None
    • None
    • CORENET Sprint 273
    • 1
    • None
    • None
    • None
    • None
    • None
    • None
    • None

      Description of problem:

      During a live CNI migration from OpenShiftSDN to OVN-Kubernetes, the cluster enters a deadlocked state, and the migration stalls indefinitely. The root cause is an incorrect OVN Source NAT (SNAT) rule applied to traffic from openshift-apiserver pods to the etcd members. This SNAT rule breaks the required mutual TLS (mTLS) authentication, causing the API server to lose connection to etcd. This failure cascades, destabilizing the authentication operator and OpenShift Data Foundation (Ceph). The unhealthy Ceph OSDs enter a crash loop, which in turn blocks worker nodes from being drained, halting the CNI migration.

      Version-Release number of selected component (if applicable):

      4.16.38
      

      How reproducible:

      Frequent

      Seen during a live CNI migration on an internal test cluster with ODF/Ceph installed.

      Steps to Reproduce:

      1. Initiate an OpenShiftSDN to OVN-Kubernetes live migration on a cluster with OpenShift Data Foundation (Ceph).
      2. Allow the migration to proceed to the point where it begins draining worker nodes.
      3. Observe that the migration stalls and the MachineConfigPool for workers enters a degraded state.

      Actual results:

      A cascading failure occurs across the cluster:

      • The CNI migration is stalled and worker MachineConfigPools are degraded.
      • Node drains fail with a timeout error because they are unable to evict rook-ceph-osd pods.
      • rook-ceph-osd pods on other nodes are in a CrashLoopBackOff state because they cannot authenticate with the Ceph monitors.
      • The authentication cluster operator is degraded with an EOF error when trying to access the OAuth server's healthz endpoint.
      • The openshift-apiserver pods are failing to connect to etcd with "connection refused" and "i/o timeout" errors.
      • The etcd pod logs show it is rejecting connections because the client "didn't provide a certificate".

      Expected results:

      The CNI live migration should proceed and complete successfully. Core cluster components like the API server, etcd, and authentication should remain stable and functional throughout the process. Ceph OSDs should remain healthy enough to allow nodes to drain sequentially.

      Additional info:

      The root cause was traced to an incorrect OVN logical NAT rule that was implemented as an OpenFlow rule on the OVS bridge. This rule rewrites the source IP of traffic from the openshift-apiserver pod to the IP of its host node, which breaks the etcd mTLS handshake.

      Evidence of SNAT rule in OVN Northbound Database:

      _uuid               : 479ae961-3f0d-4652-a5c9-b20a5c10a4ce
      ...
      external_ip         : "10.6.158.12"
      ...
      logical_ip          : "10.129.0.9"
      ...
      type                : snat
      

      Evidence of implementing OpenFlow Rule:

      ... table=45, ... nw_src=10.129.0.9 actions=ct(commit,table=46,zone=...,nat(src=10.6.158.12))
      

      Affected Platforms:

      Bare metal

              pliurh Peng Liu
              rbrattai@redhat.com Ross Brattain
              None
              None
              Anurag Saxena Anurag Saxena
              None
              Votes:
              0 Vote for this issue
              Watchers:
              3 Start watching this issue

                Created:
                Updated:
                Resolved: