-
Bug
-
Resolution: Unresolved
-
Normal
-
None
-
4.17
Description of problem:
When we move one node from one custom MCP to another custom MCP, the MCPs are reporting a wrong number of nodes. For example, we reach this situation (worker-perf MCP is not reporting the right number of nodes) $ oc get mcp,nodes NAME CONFIG UPDATED UPDATING DEGRADED MACHINECOUNT READYMACHINECOUNT UPDATEDMACHINECOUNT DEGRADEDMACHINECOUNT AGE machineconfigpool.machineconfiguration.openshift.io/master rendered-master-c8d23b071e1ccf6cf85c7f1b31c0def6 True False False 3 3 3 0 142m machineconfigpool.machineconfiguration.openshift.io/worker rendered-worker-36ee1fdc485685ac9c324769889c3348 True False False 1 1 1 0 142m machineconfigpool.machineconfiguration.openshift.io/worker-perf rendered-worker-perf-6b5fbffac62c3d437e307e849c44b556 True False False 2 2 2 0 24m machineconfigpool.machineconfiguration.openshift.io/worker-perf-canary rendered-worker-perf-canary-6b5fbffac62c3d437e307e849c44b556 True False False 1 1 1 0 7m52s NAME STATUS ROLES AGE VERSION node/ip-10-0-13-228.us-east-2.compute.internal Ready worker,worker-perf-canary 138m v1.30.4 node/ip-10-0-2-250.us-east-2.compute.internal Ready control-plane,master 145m v1.30.4 node/ip-10-0-34-223.us-east-2.compute.internal Ready control-plane,master 144m v1.30.4 node/ip-10-0-35-61.us-east-2.compute.internal Ready worker,worker-perf 136m v1.30.4 node/ip-10-0-79-232.us-east-2.compute.internal Ready control-plane,master 144m v1.30.4 node/ip-10-0-86-124.us-east-2.compute.internal Ready worker 139m v1.30.4 After 20 minutes or half an hour the MCPs start reporting the right number of nodes
Version-Release number of selected component (if applicable):
IPI on AWS version:
$ oc get clusterversion
NAME VERSION AVAILABLE PROGRESSING SINCE STATUS
version 4.17.0-0.nightly-2024-09-13-040101 True False 124m Cluster version is 4.17.0-0.nightly-2024-09-13-040101
How reproducible:
Always
Steps to Reproduce:
1. Create a MCP oc create -f - << EOF apiVersion: machineconfiguration.openshift.io/v1 kind: MachineConfigPool metadata: name: worker-perf spec: machineConfigSelector: matchExpressions: - { key: machineconfiguration.openshift.io/role, operator: In, values: [worker,worker-perf] } nodeSelector: matchLabels: node-role.kubernetes.io/worker-perf: "" EOF 2. Add 2 nodes to the MCP $ oc label node $(oc get nodes -l node-role.kubernetes.io/worker -ojsonpath="{.items[0].metadata.name}") node-role.kubernetes.io/worker-perf= $ oc label node $(oc get nodes -l node-role.kubernetes.io/worker -ojsonpath="{.items[1].metadata.name}") node-role.kubernetes.io/worker-perf= 3. Create another MCP oc create -f - << EOF apiVersion: machineconfiguration.openshift.io/v1 kind: MachineConfigPool metadata: name: worker-perf-canary spec: machineConfigSelector: matchExpressions: - { key: machineconfiguration.openshift.io/role, operator: In, values: [worker,worker-perf,worker-perf-canary] } nodeSelector: matchLabels: node-role.kubernetes.io/worker-perf-canary: "" EOF 3. Move one node from the MCP created in step 1 to the MCP created in step 3 $ oc label node $(oc get nodes -l node-role.kubernetes.io/worker -ojsonpath="{.items[0].metadata.name}") node-role.kubernetes.io/worker-perf-canary= $ oc label node $(oc get nodes -l node-role.kubernetes.io/worker -ojsonpath="{.items[0].metadata.name}") node-role.kubernetes.io/worker-perf-
Actual results:
The worker-perf pool is not reporting the right number of nodes. It continues reporting 2 nodes even though one of them was moved to the worker-perf-canary MCP. $ oc get mcp,nodes NAME CONFIG UPDATED UPDATING DEGRADED MACHINECOUNT READYMACHINECOUNT UPDATEDMACHINECOUNT DEGRADEDMACHINECOUNT AGE machineconfigpool.machineconfiguration.openshift.io/master rendered-master-c8d23b071e1ccf6cf85c7f1b31c0def6 True False False 3 3 3 0 142m machineconfigpool.machineconfiguration.openshift.io/worker rendered-worker-36ee1fdc485685ac9c324769889c3348 True False False 1 1 1 0 142m machineconfigpool.machineconfiguration.openshift.io/worker-perf rendered-worker-perf-6b5fbffac62c3d437e307e849c44b556 True False False 2 2 2 0 24m machineconfigpool.machineconfiguration.openshift.io/worker-perf-canary rendered-worker-perf-canary-6b5fbffac62c3d437e307e849c44b556 True False False 1 1 1 0 7m52s NAME STATUS ROLES AGE VERSION node/ip-10-0-13-228.us-east-2.compute.internal Ready worker,worker-perf-canary 138m v1.30.4 node/ip-10-0-2-250.us-east-2.compute.internal Ready control-plane,master 145m v1.30.4 node/ip-10-0-34-223.us-east-2.compute.internal Ready control-plane,master 144m v1.30.4 node/ip-10-0-35-61.us-east-2.compute.internal Ready worker,worker-perf 136m v1.30.4 node/ip-10-0-79-232.us-east-2.compute.internal Ready control-plane,master 144m v1.30.4 node/ip-10-0-86-124.us-east-2.compute.internal Ready worker 139m v1.30.4
Expected results:
MCPs should always report the right number of nodes
Additional info:
It is very similar to this other issue https://bugzilla.redhat.com/show_bug.cgi?id=2090436 That was discussed in this slack conversation https://redhat-internal.slack.com/archives/C02CZNQHGN8/p1653479831004619
- blocks
-
OCPBUGS-42200 MCPs report wrong number of nodes when we move nodes from one custom MCP to another custom MCP
- Closed
- is cloned by
-
OCPBUGS-42200 MCPs report wrong number of nodes when we move nodes from one custom MCP to another custom MCP
- Closed
- links to
-
RHEA-2024:6122 OpenShift Container Platform 4.18.z bug fix update