Loading...

XML

Word

Printable

Type: Task
Resolution: Done
Priority: Critical
Fix Version/s: MCE 2.5.0, MCE 2.6.0
Affects Version/s: ACM 2.10.0, ACM 2.11.0
Component/s: Documentation, HyperShift
Labels:
- doc-required
- self-managed

Blocked:
False
Blocked Reason:
None
Ready:
False
Intelligence Requested:
Market:

Regression:
No

SFDC Cases Links:
SFDC Cases Counter:
SFDC Cases Open:

Create an informative issue (See each section, incomplete templates/issues won't be triaged)

Using the current documentation as a model, please complete the issue template.

Note: Doc team updates the current version and the two previous versions (n-2). For earlier versions, we will address only high-priority, customer-reported issues for releases in support.

Prerequisite: Start with what we have

Always look at the current documentation to describe the change that is needed. Use the source or portal link for Step 4:

- Use the Customer Portal: https://access.redhat.com/documentation/en-us/red_hat_advanced_cluster_management_for_kubernetes

- Use the GitHub link to find the staged docs in the repository: https://github.com/stolostron/rhacm-docs

Describe the changes in the doc and link to your dev story

Provide info for the following steps:

1. - [x] Mandatory Add the required version to the Fix version/s field.

2. - [x] Mandatory Choose the type of documentation change.

- [ ] New topic in an existing section or new section
- [x] Update to an existing topic

3. - [x] Mandatory for GA content:

- [x] Add steps and/or other important conceptual information here:
(see section below)

- [x] Add Required access level for the user to complete the task here: Same as in already existing topic (see below)

- [x] Add verification at the end of the task, how does the user verify success (a command to run or a result to see?) (See below)

- [x] Add link to dev story here: No dev story

4. - [x] Mandatory for bugs: What is the diff? Clearly define what the problem is, what the change is, and link to the current documentation:

https://github.com/stolostron/rhacm-docs/blob/2.11_stage/clusters/hosted_control_planes/bm_intro.adoc

Problem Statement: In self-managed HA HCP clusters, a critical scheduling issue is identified when deployed on a management cluster consisting of baremetal nodes without the `"topology.kubernetes.io/zone"` label. Not withstanding the presence of anti-affinity rules labeled as "requiredDuringSchedulingIgnoredDuringExecution", a scheduling anomaly occurs, leading to all HCP pods being allocated to a single node. This is contrary to the expected behavior of a high-availability system and stems from the lack of the requisite topology key in the node labels.

Impact: This results in a clustering of crucial components like etcd and apiservers on a single node, thus compromising the intended high-availability feature of the HCP cluster.

Version-Release Number: 4.14

Reproducibility: Consistently reproducible (100%) under specified conditions.

Steps to Reproduce:

Initiate a self-managed HA HCP cluster setup.
Use a management cluster with baremetal nodes, specifically those lacking the "topology.kubernetes.io/zone" label.

Actual Results: All HCP pods are erroneously scheduled on a single node, presenting a risk of a single point of failure.

Expected Results: Ideally, HCP pods should be evenly distributed across multiple nodes to ensure the robustness of the high-availability setup.

Immediate Workaround/Solution:

Documentation Update: The immediate solution is to update the documentation, highlighting this critical issue. The documentation should clearly state the importance of having the "topology.kubernetes.io/zone" label on baremetal nodes within the management cluster to facilitate proper scheduling in HA HCP clusters.

Guidance for Users: Provide explicit instructions and best practices for labeling baremetal nodes in the management cluster, ensuring compliance with HA HCP cluster requirements.

Deployment Validation: Introduce a pre-deployment validation step that checks for the presence of essential labels on management cluster nodes. This check will alert users to any missing labels and advise on necessary adjustments before proceeding with the HCP cluster deployment.

Long-term Solution:

Enhanced Scheduling Logic: Develop and implement improved pod scheduling logic that accounts for the absence of specific labels and automatically adapts the affinity/anti-affinity rules to ensure high availability, even in environments where certain labels may be missing.

User Alert Mechanisms: Integrate mechanisms in the HCP cluster setup process to alert users of potential high-availability risks due to node label configurations, providing recommendations for label adjustments where necessary.

clones

OCPBUGS-22899 Self-managed HCP pods are scheduled on single mgmt cluster node when no zones are in use

Closed

Assignee:: Servesha Dudhgaonkar

Reporter:: Adel Zaalouk

QA Contact:: David Huynh

Votes:: 0 Vote for this issue

Watchers:: 3 Start watching this issue

Created:: 2024/01/25 10:03 AM

Updated:: 2024/05/10 3:44 PM

Resolved:: 2024/05/10 3:44 PM

Details

Description

Create an informative issue (See each section, incomplete templates/issues won't be triaged)

Prerequisite: Start with what we have

Describe the changes in the doc and link to your dev story

Attachments

Issue Links

Easy Agile Planning Poker

Activity

People

Dates