Uploaded image for project: 'Red Hat Advanced Cluster Management'
  1. Red Hat Advanced Cluster Management
  2. ACM-11454

Self-managed HCP pods are scheduled on single mgmt cluster node when no zones are in use

XMLWordPrintable

    • False
    • None
    • False
    • No

      Create an informative issue (See each section, incomplete templates/issues won't be triaged)

      Using the current documentation as a model, please complete the issue template. 

      Note: Doc team updates the current version and the two previous versions (n-2). For earlier versions, we will address only high-priority, customer-reported issues for releases in support.

      Prerequisite: Start with what we have

      Always look at the current documentation to describe the change that is needed. Use the source or portal link for Step 4:

       - Use the Customer Portal: https://access.redhat.com/documentation/en-us/red_hat_advanced_cluster_management_for_kubernetes

       - Use the GitHub link to find the staged docs in the repository: https://github.com/stolostron/rhacm-docs 

      Describe the changes in the doc and link to your dev story

      Provide info for the following steps:

      1. - [x] Mandatory Add the required version to the Fix version/s field.

      2. - [x] Mandatory Choose the type of documentation change.

            - [ ] New topic in an existing section or new section
            - [x] Update to an existing topic

      3. - [x] Mandatory for GA content:
                  
             - [x] Add steps and/or other important conceptual information here: 
             (see section below)
                  
             - [x] Add Required access level for the user to complete the task here: Same as in already existing topic (see below)
             

             - [x] Add verification at the end of the task, how does the user verify success (a command to run or a result to see?) (See below)
          
             - [x] Add link to dev story here: No dev story

      4. - [x] Mandatory for bugs: What is the diff? Clearly define what the problem is, what the change is, and link to the current documentation:

      https://github.com/stolostron/rhacm-docs/blob/2.11_stage/clusters/hosted_control_planes/bm_intro.adoc

      Problem Statement: In self-managed HA HCP clusters, a critical scheduling issue is identified when deployed on a management cluster consisting of baremetal nodes without the `"topology.kubernetes.io/zone"` label. Not withstanding the presence of anti-affinity rules labeled as "requiredDuringSchedulingIgnoredDuringExecution", a scheduling anomaly occurs, leading to all HCP pods being allocated to a single node. This is contrary to the expected behavior of a high-availability system and stems from the lack of the requisite topology key in the node labels.

      Impact: This results in a clustering of crucial components like etcd and apiservers on a single node, thus compromising the intended high-availability feature of the HCP cluster.

      Version-Release Number: 4.14

      Reproducibility: Consistently reproducible (100%) under specified conditions.

      Steps to Reproduce:

      1. Initiate a self-managed HA HCP cluster setup.
      2. Use a management cluster with baremetal nodes, specifically those lacking the "topology.kubernetes.io/zone" label.

      Actual Results: All HCP pods are erroneously scheduled on a single node, presenting a risk of a single point of failure.

      Expected Results: Ideally, HCP pods should be evenly distributed across multiple nodes to ensure the robustness of the high-availability setup.

      Immediate Workaround/Solution:

      • Documentation Update: The immediate solution is to update the documentation, highlighting this critical issue. The documentation should clearly state the importance of having the "topology.kubernetes.io/zone" label on baremetal nodes within the management cluster to facilitate proper scheduling in HA HCP clusters.
      • Guidance for Users: Provide explicit instructions and best practices for labeling baremetal nodes in the management cluster, ensuring compliance with HA HCP cluster requirements.
      • Deployment Validation: Introduce a pre-deployment validation step that checks for the presence of essential labels on management cluster nodes. This check will alert users to any missing labels and advise on necessary adjustments before proceeding with the HCP cluster deployment.

      Long-term Solution:

      • Enhanced Scheduling Logic: Develop and implement improved pod scheduling logic that accounts for the absence of specific labels and automatically adapts the affinity/anti-affinity rules to ensure high availability, even in environments where certain labels may be missing.
      • User Alert Mechanisms: Integrate mechanisms in the HCP cluster setup process to alert users of potential high-availability risks due to node label configurations, providing recommendations for label adjustments where necessary.

            sdudhgao@redhat.com Servesha Dudhgaonkar
            azaalouk Adel Zaalouk
            David Huynh David Huynh
            Votes:
            0 Vote for this issue
            Watchers:
            3 Start watching this issue

              Created:
              Updated:
              Resolved: