-
Bug
-
Resolution: Won't Do
-
Normal
-
None
-
4.14, 4.15, 4.16, 4.17, 4.18
-
Moderate
-
Yes
-
False
-
This is a clone of issue OCPBUGS-17199. The following is the description of the original issue:
—
this is case 2 from OCPBUGS-14673
Description of problem:
MHC for control plane cannot work right for some cases 2.Stop the kubelet service on the master node, new master get Running, the old one stuck in Deleting, many co degraded. This is a regression bug, because I tested this on 4.12 around September 2022, case 2 and case 3 work right. https://polarion.engineering.redhat.com/polarion/#/project/OSE/workitem?id=OCP-54326
Version-Release number of selected component (if applicable):
4.14.0-0.nightly-2023-06-05-112833 4.13.0-0.nightly-2023-06-06-194351 4.12.0-0.nightly-2023-06-07-005319
How reproducible:
Always
Steps to Reproduce:
1.Create MHC for control plane apiVersion: machine.openshift.io/v1beta1 kind: MachineHealthCheck metadata: name: control-plane-health namespace: openshift-machine-api spec: maxUnhealthy: 1 selector: matchLabels: machine.openshift.io/cluster-api-machine-type: master unhealthyConditions: - status: "False" timeout: 300s type: Ready - status: "Unknown" timeout: 300s type: Ready liuhuali@Lius-MacBook-Pro huali-test % oc create -f mhc-master3.yaml machinehealthcheck.machine.openshift.io/control-plane-health created liuhuali@Lius-MacBook-Pro huali-test % oc get mhc NAME MAXUNHEALTHY EXPECTEDMACHINES CURRENTHEALTHY control-plane-health 1 3 3 machine-api-termination-handler 100% 0 0 Case 2.Stop the kubelet service on the master node, new master get Running, the old one stuck in Deleting, many co degraded. liuhuali@Lius-MacBook-Pro huali-test % oc debug node/huliu-az7c-svq9q-master-1 Starting pod/huliu-az7c-svq9q-master-1-debug ... To use host binaries, run `chroot /host` Pod IP: 10.0.0.6 If you don't see a command prompt, try pressing enter. sh-4.4# chroot /host sh-5.1# systemctl stop kubelet Removing debug pod ... liuhuali@Lius-MacBook-Pro huali-test % oc get node NAME STATUS ROLES AGE VERSION huliu-az7c-svq9q-master-1 Ready control-plane,master 95m v1.26.5+7a891f0 huliu-az7c-svq9q-master-2 Ready control-plane,master 95m v1.26.5+7a891f0 huliu-az7c-svq9q-master-c96k8-0 Ready control-plane,master 19m v1.26.5+7a891f0 huliu-az7c-svq9q-worker-westus-5r8jf Ready worker 34m v1.26.5+7a891f0 huliu-az7c-svq9q-worker-westus-k747l Ready worker 47m v1.26.5+7a891f0 huliu-az7c-svq9q-worker-westus-r2vdn Ready worker 83m v1.26.5+7a891f0 liuhuali@Lius-MacBook-Pro huali-test % oc get machine NAME PHASE TYPE REGION ZONE AGE huliu-az7c-svq9q-master-1 Running Standard_D8s_v3 westus 97m huliu-az7c-svq9q-master-2 Running Standard_D8s_v3 westus 97m huliu-az7c-svq9q-master-c96k8-0 Running Standard_D8s_v3 westus 23m huliu-az7c-svq9q-worker-westus-5r8jf Running Standard_D4s_v3 westus 39m huliu-az7c-svq9q-worker-westus-k747l Running Standard_D4s_v3 westus 53m huliu-az7c-svq9q-worker-westus-r2vdn Running Standard_D4s_v3 westus 91m liuhuali@Lius-MacBook-Pro huali-test % oc get node NAME STATUS ROLES AGE VERSION huliu-az7c-svq9q-master-1 NotReady control-plane,master 107m v1.26.5+7a891f0 huliu-az7c-svq9q-master-2 Ready control-plane,master 107m v1.26.5+7a891f0 huliu-az7c-svq9q-master-c96k8-0 Ready control-plane,master 32m v1.26.5+7a891f0 huliu-az7c-svq9q-master-jdhgg-1 Ready control-plane,master 2m10s v1.26.5+7a891f0 huliu-az7c-svq9q-worker-westus-5r8jf Ready worker 46m v1.26.5+7a891f0 huliu-az7c-svq9q-worker-westus-k747l Ready worker 59m v1.26.5+7a891f0 huliu-az7c-svq9q-worker-westus-r2vdn Ready worker 95m v1.26.5+7a891f0 liuhuali@Lius-MacBook-Pro huali-test % oc get machine NAME PHASE TYPE REGION ZONE AGE huliu-az7c-svq9q-master-1 Deleting Standard_D8s_v3 westus 110m huliu-az7c-svq9q-master-2 Running Standard_D8s_v3 westus 110m huliu-az7c-svq9q-master-c96k8-0 Running Standard_D8s_v3 westus 36m huliu-az7c-svq9q-master-jdhgg-1 Running Standard_D8s_v3 westus 5m55s huliu-az7c-svq9q-worker-westus-5r8jf Running Standard_D4s_v3 westus 52m huliu-az7c-svq9q-worker-westus-k747l Running Standard_D4s_v3 westus 65m huliu-az7c-svq9q-worker-westus-r2vdn Running Standard_D4s_v3 westus 103m liuhuali@Lius-MacBook-Pro huali-test % oc get machine NAME PHASE TYPE REGION ZONE AGE huliu-az7c-svq9q-master-1 Deleting Standard_D8s_v3 westus 3h huliu-az7c-svq9q-master-2 Running Standard_D8s_v3 westus 3h huliu-az7c-svq9q-master-c96k8-0 Running Standard_D8s_v3 westus 105m huliu-az7c-svq9q-master-jdhgg-1 Running Standard_D8s_v3 westus 75m huliu-az7c-svq9q-worker-westus-5r8jf Running Standard_D4s_v3 westus 122m huliu-az7c-svq9q-worker-westus-k747l Running Standard_D4s_v3 westus 135m huliu-az7c-svq9q-worker-westus-r2vdn Running Standard_D4s_v3 westus 173m liuhuali@Lius-MacBook-Pro huali-test % oc get node NAME STATUS ROLES AGE VERSION huliu-az7c-svq9q-master-1 NotReady control-plane,master 178m v1.26.5+7a891f0 huliu-az7c-svq9q-master-2 Ready control-plane,master 178m v1.26.5+7a891f0 huliu-az7c-svq9q-master-c96k8-0 Ready control-plane,master 102m v1.26.5+7a891f0 huliu-az7c-svq9q-master-jdhgg-1 Ready control-plane,master 72m v1.26.5+7a891f0 huliu-az7c-svq9q-worker-westus-5r8jf Ready worker 116m v1.26.5+7a891f0 huliu-az7c-svq9q-worker-westus-k747l Ready worker 129m v1.26.5+7a891f0 huliu-az7c-svq9q-worker-westus-r2vdn Ready worker 165m v1.26.5+7a891f0 liuhuali@Lius-MacBook-Pro huali-test % oc get co NAME VERSION AVAILABLE PROGRESSING DEGRADED SINCE MESSAGE authentication 4.13.0-0.nightly-2023-06-06-194351 True True True 107m APIServerDeploymentDegraded: 1 of 4 requested instances are unavailable for apiserver.openshift-oauth-apiserver ()... baremetal 4.13.0-0.nightly-2023-06-06-194351 True False False 174m cloud-controller-manager 4.13.0-0.nightly-2023-06-06-194351 True False False 176m cloud-credential 4.13.0-0.nightly-2023-06-06-194351 True False False 3h cluster-autoscaler 4.13.0-0.nightly-2023-06-06-194351 True False False 173m config-operator 4.13.0-0.nightly-2023-06-06-194351 True False False 175m console 4.13.0-0.nightly-2023-06-06-194351 True False False 136m control-plane-machine-set 4.13.0-0.nightly-2023-06-06-194351 True False False 71m csi-snapshot-controller 4.13.0-0.nightly-2023-06-06-194351 True False False 174m dns 4.13.0-0.nightly-2023-06-06-194351 True True False 173m DNS "default" reports Progressing=True: "Have 6 available node-resolver pods, want 7." etcd 4.13.0-0.nightly-2023-06-06-194351 True True True 173m NodeControllerDegraded: The master nodes not ready: node "huliu-az7c-svq9q-master-1" not ready since 2023-06-07 08:47:34 +0000 UTC because NodeStatusUnknown (Kubelet stopped posting node status.) image-registry 4.13.0-0.nightly-2023-06-06-194351 True True False 165m Progressing: The registry is ready... ingress 4.13.0-0.nightly-2023-06-06-194351 True False False 165m insights 4.13.0-0.nightly-2023-06-06-194351 True False False 168m kube-apiserver 4.13.0-0.nightly-2023-06-06-194351 True True True 171m NodeControllerDegraded: The master nodes not ready: node "huliu-az7c-svq9q-master-1" not ready since 2023-06-07 08:47:34 +0000 UTC because NodeStatusUnknown (Kubelet stopped posting node status.) kube-controller-manager 4.13.0-0.nightly-2023-06-06-194351 True False True 171m NodeControllerDegraded: The master nodes not ready: node "huliu-az7c-svq9q-master-1" not ready since 2023-06-07 08:47:34 +0000 UTC because NodeStatusUnknown (Kubelet stopped posting node status.) kube-scheduler 4.13.0-0.nightly-2023-06-06-194351 True False True 171m NodeControllerDegraded: The master nodes not ready: node "huliu-az7c-svq9q-master-1" not ready since 2023-06-07 08:47:34 +0000 UTC because NodeStatusUnknown (Kubelet stopped posting node status.) kube-storage-version-migrator 4.13.0-0.nightly-2023-06-06-194351 True False False 106m machine-api 4.13.0-0.nightly-2023-06-06-194351 True False False 167m machine-approver 4.13.0-0.nightly-2023-06-06-194351 True False False 174m machine-config 4.13.0-0.nightly-2023-06-06-194351 False False True 60m Cluster not available for [{operator 4.13.0-0.nightly-2023-06-06-194351}]: failed to apply machine config daemon manifests: error during waitForDaemonsetRollout: [timed out waiting for the condition, daemonset machine-config-daemon is not ready. status: (desired: 7, updated: 7, ready: 6, unavailable: 1)] marketplace 4.13.0-0.nightly-2023-06-06-194351 True False False 174m monitoring 4.13.0-0.nightly-2023-06-06-194351 True False False 106m network 4.13.0-0.nightly-2023-06-06-194351 True True False 177m DaemonSet "/openshift-network-diagnostics/network-check-target" is not available (awaiting 1 nodes)... node-tuning 4.13.0-0.nightly-2023-06-06-194351 True False False 173m openshift-apiserver 4.13.0-0.nightly-2023-06-06-194351 True True True 107m APIServerDeploymentDegraded: 1 of 4 requested instances are unavailable for apiserver.openshift-apiserver () openshift-controller-manager 4.13.0-0.nightly-2023-06-06-194351 True False False 170m openshift-samples 4.13.0-0.nightly-2023-06-06-194351 True False False 167m operator-lifecycle-manager 4.13.0-0.nightly-2023-06-06-194351 True False False 174m operator-lifecycle-manager-catalog 4.13.0-0.nightly-2023-06-06-194351 True False False 174m operator-lifecycle-manager-packageserver 4.13.0-0.nightly-2023-06-06-194351 True False False 168m service-ca 4.13.0-0.nightly-2023-06-06-194351 True False False 175m storage 4.13.0-0.nightly-2023-06-06-194351 True True False 174m AzureDiskCSIDriverOperatorCRProgressing: AzureDiskDriverNodeServiceControllerProgressing: Waiting for DaemonSet to deploy node pods... liuhuali@Lius-MacBook-Pro huali-test % ----------------------- There might be an easier way by just rolling a revision in etcd, stopping kubelet and then observing the same issue.
Actual results:
CEO's member removal controller is getting stuck on the IsBootstrapComplete check that was introduced to fix another bug: https://github.com/openshift/cluster-etcd-operator/commit/c96150992a8aba3654835787be92188e947f557c#diff-d91047e39d2c1ab6b35e69359a24e83c19ad9b3e9ad4e44f9b1ac90e50f7b650R97 turns out IsBootstrapComplete checks whether a revision is currently rolling out (makes sense) and that one NotReady node with kubelet gone still has a revision going (rev 7, target 9). more info: https://issues.redhat.com/browse/OCPBUGS-14673?focusedId=22726712&page=com.atlassian.jira.plugin.system.issuetabpanels%3Acomment-tabpanel#comment-22726712 This causes the etcd member to not be removed. Which in turn blocks the vertical scale-down procedure to remove the pre-drain hook as the member is still present. Effectively you end up with a cluster of 4 CP machines, where one is stuck in Deleting state.
Expected results:
The etcd member should be removed and the machine/node should be deleted
Additional info:
Removing the revision check does fix this issue reliably, but might not be desirable: https://github.com/openshift/cluster-etcd-operator/pull/1087
- clones
-
OCPBUGS-17199 CEO prevents member deletion during revision rollout
- Verified
- is blocked by
-
OCPBUGS-17199 CEO prevents member deletion during revision rollout
- Verified