Uploaded image for project: 'OpenShift Bugs'
  1. OpenShift Bugs
  2. OCPBUGS-47802

Multiple reboots during EUS upgrade on Control Plane nodes

XMLWordPrintable

    • Icon: Bug Bug
    • Resolution: Unresolved
    • Icon: Normal Normal
    • None
    • 4.14.z, 4.15.z, 4.16.z
    • Moderate
    • None
    • False
    • Hide

      None

      Show
      None
    • N/A
    • Release Note Not Required
    • Done

      This is a clone of issue OCPBUGS-46460. The following is the description of the original issue:

      This is a clone of issue OCPBUGS-42636. The following is the description of the original issue:

      Description of problem:

          During the EUS to EUS upgrade of a MNO cluster from 4.14.16 to 4.16.11 on baremetal, we have seen that depending on the custom configuration, like performance profile or container runtime config, one or more control plane nodes are rebooted multiple times. 
      
      Seems that this is a race condition. When the first MachineConfig rendered is generated, the first Control Plane node start the reboot(the maxUnavailable is set to 1 on the master MCP), and at this moment a new MachineConfig render is generated, what means a second reboot. Once this first node is rebooted the second time, the rest of the Control Plane nodes are rebooted just once, because no more new MachineConfig renders are generated.

      Version-Release number of selected component (if applicable):

          OCP 4.14.16 > 4.15.31  > 4.16.11

      How reproducible:

          Perform the upgrade of a Multi Node OCP with a custom configuration like a performance profile or container runtime configuration (like force cgroups v1, or update runc to crun)

      Steps to Reproduce:

          1. Deploy on baremetal a MNO OCP 4.14 with a custom manifest, like the below:
      
      ---
      apiVersion: config.openshift.io/v1
      kind: Node
      metadata:
        name: cluster
      spec:
        cgroupMode: v1
      
          2. Upgrade the cluster to the next minor version available, for instance 4.15.31, make a partial upgrade pausing the worker Machine Config Pool.
      
          3. Monitoring the upgrade process (cluster operators, Machine Configs, Machine Config Pools and nodes)
          

      Actual results:

          You will see that once almost all the Cluster Operators are in the 4.15.31 version, except the Machine Config Operator, at this moment review the MachineConfig reders that are generated for the master Machine Config Pool, and also monitor the nodes, to see that new MachineConfig render is generated once the first Control Plane node has been rebooted.

      Expected results:

        What is expected is that in a upgrade only one Machine Config Render is generated per Machine Config Pool, and only one reboot per node to finish the upgrade.  

      Additional info:

          

              team-mco Team MCO
              openshift-crt-jira-prow OpenShift Prow Bot
              Sergio Regidor de la Rosa Sergio Regidor de la Rosa
              Votes:
              0 Vote for this issue
              Watchers:
              5 Start watching this issue

                Created:
                Updated: