Uploaded image for project: 'OpenShift Bugs'
  1. OpenShift Bugs
  2. OCPBUGS-11184

Machine Config Pools not updating correctly for nodes with more than one custom role

XMLWordPrintable

    • Low
    • No
    • False
    • Hide

      None

      Show
      None

      Description of problem:

      When a node has two or more custom roles, such as when performing a cluster update using canary rollout strategy, it is not counted amongst either roles. This leads to a reduced number of nodes in machineconfigpool reports, and the incorrect assumption that no nodes match the ancillary pool, such as the canary pool.

      Version-Release number of selected component (if applicable):

      This behaviour was observed in both 4.8 and 4.11 but probably also present in other 4.x versions

      How reproducible:

      Easily reproducible.

      Steps to Reproduce:

      1. Create a 4.8 cluster with 3+ workers.
      2. Customise the role of some workers to be something other than "worker", for example as done when having some nodes configured with larger PID limits
      3. Create additional worker pools, as described by the official documentaiton in the canary rollout stragegy for upgrades: https://docs.openshift.com/container-platform/4.8/updating/update-using-custom-machine-config-pools.html
      4. Label some nodes with the new role without removing their original role

      Actual results:

      The machine counts in the machine configuration pool for the original role are decremented, but the new machine configuration pool still shows 0 machines.

      Expected results:

      The machine counts should properly reflect the roles of the nodes present in the cluster

      Additional info:

      The machine-config-controller logs show the following message for nodes with multiple custom roles:
      
      ~~~
      2023-03-23T17:53:18.030255756Z W0323 17:53:18.030197       1 node_controller.go:798] can't get pool for node "worker1.example.com": node worker1.example.com belongs to 2 custom roles, cannot proceed with this Node
      2023-03-23T17:58:05.259428149Z E0323 17:58:05.259321       1 node_controller.go:441] error finding pool for node: node worker1.example.com belongs to 2 custom roles, cannot proceed with this Node
      ~~~
      
      Observed on 4.8 and 4.11 but probably present on other versions

       

            jerzhang@redhat.com Yu Qi Zhang
            rhn-support-pauwebst Paul Webster
            Rio Liu Rio Liu
            Paul Webster
            Votes:
            0 Vote for this issue
            Watchers:
            11 Start watching this issue

              Created:
              Updated:
              Resolved: