Uploaded image for project: 'OpenShift Bugs'
  1. OpenShift Bugs
  2. OCPBUGS-45045

cpou upgrade with infra mcp paused failed as mco expects a newer version of infra mc

XMLWordPrintable

    • Icon: Bug Bug
    • Resolution: Not a Bug
    • Icon: Major Major
    • None
    • 4.16, 4.18
    • None
    • Quality / Stability / Reliability
    • False
    • Hide

      None

      Show
      None
    • None
    • None
    • None
    • None
    • None
    • Rejected
    • None
    • None
    • None
    • None
    • None
    • None
    • None
    • None

      Description of problem:

      In cpou upgrade scenario(from 4.16 to 4.18) with a paused infra mcp, the mco is degraded because it expects the controller version in the infra mc to be the 4.17 version.

      cpou upgrade

      Version-Release number of selected component (if applicable):

      4.16 to 4.18 cpou upgrade

      How reproducible:

      Every time

      Steps to Reproduce:

          1. Install a 4.16 cluster (in my test it is Azure with IPSEC)
          2. Install infra machinesets with 3 infra nodes, move some infra components to infra nodes like monitoring/ingress/registry
          3. Do cpou upgrade from 4.16 to 4.18
            3.1 Pause the worker and infra mcp
            3.2 Start the upgrade to 4.17
          

      Actual results:

          The upgrade to 4.17 failed because the mco is degraded.

      Expected results:

          The master nodes are upgraded to 4.17. The worker and infra nodes should stay with 4.16 because they are paused.

      Additional info:

          In a test without infra mcp, the cpou upgrade works well.
          OTA functional qe has test case with a customer mcp, with worker lable, the cpou upgrade works well.
          In my failed test, the infra mcp does not have worker label, it only has a infra label.

       

      Failed test job

      oc adm upgrade status showed that one operator is degraded

      = Control Plane =
      Assessment:      Stalled
      Target Version:  4.17.0-0.nightly-2024-11-21-052346 (from 4.16.23)
      Completion:      97% (32 operators updated, 1 updating, 0 waiting)
      Duration:        4h4m (Est. Time Remaining: N/A; estimate duration was 1h35m)
      Operator Status: 32 Healthy, 1 Available but degraded

      The degraded operator is mco. It stuck because it expects the 4.17 version of infra mc. However, infra mcp are paused thus will not upgrade to 4.17.

      = Update Health =
      Message: Cluster Operator machine-config is degraded (RequiredPoolsFailed)
        Since:       58m9s
        Level:       Error
        Impact:      API Availability
        Reference:   https://github.com/openshift/runbooks/blob/master/alerts/cluster-monitoring-operator/ClusterOperatorDegraded.md
        Resources:
          clusteroperators.config.openshift.io: machine-config
        Description: Unable to apply 4.17.0-0.nightly-2024-11-21-052346: error during syncRequiredMachineConfigPools: [context deadline exceeded, MachineConfigPool infra has not progressed to latest configuration: controller version mismatch for rendered-infra-6c171d9d397c09f3d4b0b81d46df2c05 expected 39e1cd3c3b04229c48988be1fb7f99b95856aff3 has 4bb3364914c4dbcdfcc08b0914f402cdd38f014f: <unknown>, retrying]
      Message: Cluster Version version is failing to proceed with the update (ClusterOperatorDegraded)
        Since:       3m58s
        Level:       Warning
        Impact:      Update Stalled
        Reference:   https://github.com/openshift/runbooks/blob/master/alerts/cluster-monitoring-operator/ClusterOperatorDegraded.md
        Resources:
          clusterversions.config.openshift.io: version
        Description: Cluster operator machine-config is degraded
      Message: Outdated nodes in a paused pool 'infra' will not be updated
        Since:       -
        Level:       Warning
        Impact:      Update Stalled
        Reference:   https://docs.openshift.com/container-platform/latest/support/troubleshooting/troubleshooting-operator-issues.html#troubleshooting-disabling-autoreboot-mco_troubleshooting-operator-issues
        Resources:
          machineconfigpools.machineconfiguration.openshift.io: infra
        Description: Pool is paused, which stops all changes to the nodes in the pool, including updates. The nodes will not be updated until the pool is unpaused by the administrator.
      Message: Outdated nodes in a paused pool 'worker' will not be updated
        Since:       -
        Level:       Warning
        Impact:      Update Stalled
        Reference:   https://docs.openshift.com/container-platform/latest/support/troubleshooting/troubleshooting-operator-issues.html#troubleshooting-disabling-autoreboot-mco_troubleshooting-operator-issues
        Resources:
          machineconfigpools.machineconfiguration.openshift.io: worker
        Description: Pool is paused, which stops all changes to the nodes in the pool, including updates. The nodes will not be updated until the pool is unpaused by the administrator. 

      The infra mc

      rendered-infra-37d5ea50ae2274a6829c836c74ef0ca7    39e1cd3c3b04229c48988be1fb7f99b95856aff3   3.4.0             3h15m
      rendered-infra-6c171d9d397c09f3d4b0b81d46df2c05    4bb3364914c4dbcdfcc08b0914f402cdd38f014f   3.4.0             5h2m

       

      An possible workaround for customer to do the cpou upgrade with infra mcp:

      I did a test with only worker mcp paused, infra mcp NOT paused. The infra mcp can be upgraded together with master mcp. And finally the cpou upgrade job was successful.

              jerzhang@redhat.com Yu Qi Zhang
              rhn-support-qili Qiujie Li
              None
              None
              Sergio Regidor de la Rosa Sergio Regidor de la Rosa
              None
              Votes:
              0 Vote for this issue
              Watchers:
              7 Start watching this issue

                Created:
                Updated:
                Resolved: