-
Bug
-
Resolution: Done-Errata
-
Critical
-
4.15.0
-
Critical
-
No
-
Rejected
-
False
-
Description of problem:
On Running PerfScale test on staging sectors, the script creates 1 HC per minute to load up a Management Cluster to its maximum capacity(64 HC). There were 2 clusters trying to use same serving node pair and got in to a deadlock
# oc get nodes -l osd-fleet-manager.openshift.io/paired-nodes=serving-12 NAME STATUS ROLES AGE VERSION ip-10-0-4-127.us-east-2.compute.internal Ready worker 34m v1.27.11+d8e449a ip-10-0-84-196.us-east-2.compute.internal Ready worker 34m v1.27.11+d8e449a Each node got assigned to 2 different cluster # oc get nodes -l hypershift.openshift.io/cluster=ocm-staging-2bcimf68iudmq2pctkj11os571ahutr1-mukri-dysn-0017 NAME STATUS ROLES AGE VERSION ip-10-0-4-127.us-east-2.compute.internal Ready worker 33m v1.27.11+d8e449a # oc get nodes -l hypershift.openshift.io/cluster=ocm-staging-2bcind28698qgrugl87laqerhhb0u2c2-mukri-dysn-0019 NAME STATUS ROLES AGE VERSION ip-10-0-84-196.us-east-2.compute.internal Ready worker 36m v1.27.11+d8e449a Taints were missing on those nodes, so metric-forwarder pod from other hostedclusters got scheduled on serving nodes. # oc get pods -A -o wide | grep ip-10-0-84-196.us-east-2.compute.internal ocm-staging-2bcind28698qgrugl87laqerhhb0u2c2-mukri-dysn-0019 kube-apiserver-86d4866654-brfkb 5/5 Running 0 40m 10.128.48.6 ip-10-0-84-196.us-east-2.compute.internal <none> <none> ocm-staging-2bcins06s2acm59sp85g4qd43g9hq42g-mukri-dysn-0020 metrics-forwarder-6d787d5874-69bv7 1/1 Running 0 40m 10.128.48.7 ip-10-0-84-196.us-east-2.compute.internal <none> <none> and few more
Version-Release number of selected component (if applicable):
MC Version 4.14.17 HC version 4.15.10 HO Version quay.io/acm-d/rhtap-hypershift-operator:c698d1da049c86c2cfb4c0f61ca052a0654e2fb9
How reproducible:
Not Always
Steps to Reproduce:
1. Create an MC with prod config (non-dynamic serving node) 2. Create HCs on them at 1 HCP per minutes 3. Cluster stuck at installing for more than 30 minutes
Actual results:
Only one replica of Kube-apiserver pods were up and the second stuck at pending state, upon checking the machine API has scaled both nodes in that machineset(serving-12) but only one got assigned(labelled). Further checking that node from one zone(serving-12a) was assigned to a specific hosted cluster(0017), and the other one(serving-12b) was assigned to a different hosted cluster(0019)
Expected results:
Kube-apiserver replica should be on the same machinesets and those node should be tainted.
Additional info: Slack
- relates to
-
HOSTEDCP-1695 HyperShift 0.1.35
- Closed
- links to
-
RHEA-2024:3718 OpenShift Container Platform 4.17.z bug fix update