Uploaded image for project: 'OpenShift Bugs'
  1. OpenShift Bugs
  2. OCPBUGS-55282

OpenShift CNI Live Migration blocked due to Pod Disruption Budget and network bridge failure between SDN and OVN

XMLWordPrintable

    • Quality / Stability / Reliability
    • False
    • Hide

      None

      Show
      None
    • None
    • None
    • None
    • None
    • Approved
    • CORENET Sprint 271
    • 1
    • Customer Escalated
    • Done
    • Bug Fix
    • Hide
      *Cause*: What actions or circumstances cause this bug to present.
      *Consequence*: What happens when the bug presents.
      *Fix*: What was done to fix the bug.
      *Result*: Bug doesn’t present anymore.
      Show
      *Cause*: What actions or circumstances cause this bug to present. *Consequence*: What happens when the bug presents. *Fix*: What was done to fix the bug. *Result*: Bug doesn’t present anymore.
    • None
    • None
    • None
    • None

      Environment: OpenShift 4.16.32 (production cluster, AWS, OVN limited live migration)

      Description

      During a Limited Live Migration from OpenShift SDN to OVN-Kubernetes, the migration process stalled. Nodes remained in mixed CNI states (SDN + OVN), leading to:

      • Application unavailability
      • DNS resolution failures
      • ImagePullBackOff errors
      • Broken service-to-service communication across the cluster

      This behavior appears to contradict the expected automatic and non-disruptive migration flow defined in the official documentation.

      Issue Summary

      • Migration was blocked due to MachineConfigPool (MCP) not draining nodes.
      • Cause: Pod Disruption Budgets (PDBs) prevented eviction during migration.
      • The network bridge between SDN and OVN failed mid-rollout when new worker nodes joined during the second phase.
      • All Master nodes have completed the second reboot cycle, and they are now on OVNKubernetes Network.
      • 6 out of 8 worker nodes remained in SDN state; only 2 completed OVN configuration.
      • DNS queries sent to CoreDNS from a cluster-image -registry POD running on a Master node were seen in PCAP, but no responses were received.
      • When we tried connecting to the CoreDNS and Default Kubernetes Service from any POD running on the nodes that have completed the second reboot, it failed.
      • Contrack packets remained UNREPLIED; ICMP shows port unreachable.
      • Detailed Analysis given in the first comment of this bug.

      Customer Impact

      • The production application went down
      • The customer could not proceed with further migrations
      • Significant delay in infrastructure expansion plans
      • Risk of repeated failure in upcoming cluster migrations (April 28th)

      Business value: USD 5M account

      Expected Behavior

      • SDN/OVN bridge should persist and allow traffic between nodes in the transitional state.

      Actual Behavior

      • Nodes stay in inconsistent SDN/OVN states, breaking internal networking and service discovery

      References

      Attachments (in case 04111495)

      • SOS reports from Master Nodes:

      https://attachments.access.redhat.com/hydra/rest/cases/04111495/attachments/e935f622-b4cf-4a96-b5c3-41dd0953a04c?usePresignedUrl=true

      https://attachments.access.redhat.com/hydra/rest/cases/04111495/attachments/c20a9abf-73c5-4b99-aa2f-5b631c74ef20?usePresignedUrl=true

      https://attachments.access.redhat.com/hydra/rest/cases/04111495/attachments/5736ddc3-933a-4c20-82ef-c2d01d9ebb8d?usePresignedUrl=true

              pliurh Peng Liu
              rhn-support-swasthan Swadeep Asthana (Inactive)
              None
              None
              Zhanqi Zhao Zhanqi Zhao
              None
              Votes:
              0 Vote for this issue
              Watchers:
              15 Start watching this issue

                Created:
                Updated:
                Resolved: