Uploaded image for project: 'OpenShift Request For Enhancement'
  1. OpenShift Request For Enhancement
  2. RFE-5403

Alternate profiles for ordering the worker nodes for cluster updates

XMLWordPrintable

    • Icon: Feature Request Feature Request
    • Resolution: Unresolved
    • Icon: Critical Critical
    • None
    • None
    • Hosted Control Planes, MCO
    • False
    • None
    • False
    • Not Selected

      1. Proposed title of this feature request

      Alternate profiles for ordering the worker nodes for cluster updates

      2. What is the nature and description of the request?

      First, there is certain order [1] in which worker nodes are picked by Machine Config Operator for cluster updates. Second, there is MaxUnavailable [2] in MCO that allows customers to pick multiple worker nodes within a MCP for updates simultaneously. Third, draining of nodes is subject [3] to PDBs and Eviction APIs. 

      Customers want alternate profiles that allow them to complete cluster updates more quickly in cases where the combination of above three mechanisms makes the cluster updates ’stuck’. By ’stuck’, the upgrade stalls by either picking node(s) that can not successfully drain or picking incorrect number of nodes for simultaneous drain. 

      3. Why does the customer need this? (List the business requirements here)

      Customers need this in ROSA HCP where it is a) already possible to upgrade machine pools both independently and simultaneously b) There is separate machine config for each node pool and c) MaxUnavailable is available in the HyperShift API. [4] d) it is possible to create multiple node pools

      The workloads in the cluster are such that there is no guarantee on PDBs. That is a) workloads may not even have PDBs. b) workloads may have PDBs that do not allow any disruptions c) workloads may have PDBs that allow disruption. The administrator who performs cluster update will not know before hand which node pools or workloads fall in to any of the above 3 pattern.

      When MaxUnavailable is chosen such that chosen number of nodes can not drain, the cluster update will not progress for hours if not days. When clusters can have 500 nodes, the cluster update can take days. It requires a trial and error to pick right MaxUnavailable for every cluster update or day-2 configuration tasks like changing KubeletConfig. 

      4. List any affected packages or components.

      HyperShift Operator, Machine Config Controller

       

      [1] https://docs.openshift.com/container-platform/4.15/post_installation_configuration/machine-configuration-tasks.html#machine-config-overview-post-install-machine-configuration-tasks

      [2] https://docs.openshift.com/container-platform/4.15/updating/understanding_updates/how-updates-work.html#mco-update-process_how-updates-work 

      [3] https://docs.openshift.com/container-platform/4.15/nodes/nodes/nodes-nodes-rebooting.html#nodes-nodes-rebooting-gracefully_nodes-nodes-rebooting

      [4] https://hypershift-docs.netlify.app/how-to/automated-machine-management/configure-machines/ 

              azaalouk Adel Zaalouk
              rh-ee-bchandra Balachandran Chandrasekaran
              Votes:
              0 Vote for this issue
              Watchers:
              7 Start watching this issue

                Created:
                Updated: