Uploaded image for project: 'OpenShift Bugs'
  1. OpenShift Bugs
  2. OCPBUGS-44673

After upgrade from 4.16 to 4.17, network cluster operator is degraded and ovnkube-control-plane pods stuck in CrashLoopBackOff

XMLWordPrintable

    • Icon: Bug Bug
    • Resolution: Unresolved
    • Icon: Major Major
    • None
    • 4.17.z
    • None
    • Important
    • None
    • False
    • Hide

      None

      Show
      None

      Description of problem: In two separate labs, after upgrading from 4.16 to 4.17.3, the network ClusterOperator never becomes healthy. The ovnkube-control-plane-* pods are crash looping with these errors in the logs repeatedly:

      I1118 16:20:31.898589       1 log.go:245] http: proxy error: dial tcp 127.0.0.1:29108: connect: connection refused
      I1118 16:20:36.066389       1 log.go:245] http: proxy error: dial tcp 127.0.0.1:29108: connect: connection refused

      Version-Release number of selected component (if applicable): 4.17.3

      How reproducible: Upgrade from OCP4.16 to OCP4.17.3

      Steps to Reproduce:

      1. Original environment was 4.16.x (4.16.15 at least, maybe higher)

      2. Upgrade via the Console. The upgrade will work fine except the network operator never becomes healthy

      Actual results: network ClusterOperator stuck in degraded mode

      Expected results: network ClusterOperator should work fine

      Additional info: Not sure if it's related, but both clusters have NMState operator installed and bridges defined on the workers on a dedicated NIC that is on the same network as br-ex. In one cluster, I removed the bridge and even removed the secondary NICs from the worker nodes entirely, but the problem persisted.

      Furthermore, I have a third cluster that is a fresh OCP4.17 installation (not an upgrade) and I had the same problem. Removing the NMState bridges corrected the issue (in that cluster, the bridges weren't actually necessary, anyway). Removing the bridges in the upgraded clusters does not seem to resolve the issue.

      Please fill in the following template while reporting a bug and provide as much relevant information as possible. Doing so will give us the best chance to find a prompt resolution.

      Affected Platforms:

      Is it an

      1. Internal RedHat testing failure

      If it is an internal RedHat testing failure:

        •  

       

              bbennett@redhat.com Ben Bennett
              rhn-support-msecaur Matthew Secaur
              Anurag Saxena Anurag Saxena
              Votes:
              0 Vote for this issue
              Watchers:
              5 Start watching this issue

                Created:
                Updated: