-
Bug
-
Resolution: Won't Do
-
Normal
-
None
-
4.13, 4.12, 4.14
-
Quality / Stability / Reliability
-
False
-
-
None
-
Moderate
-
No
-
None
-
None
-
None
-
None
-
None
-
None
-
None
-
None
-
None
-
None
This is a clone of issue OCPBUGS-23044. The following is the description of the original issue:
—
This is a clone of issue OCPBUGS-21802. The following is the description of the original issue:
—
This is a clone of issue OCPBUGS-17199. The following is the description of the original issue:
—
this is case 2 from OCPBUGS-14673
Description of problem:
MHC for control plane cannot work right for some cases 2.Stop the kubelet service on the master node, new master get Running, the old one stuck in Deleting, many co degraded. This is a regression bug, because I tested this on 4.12 around September 2022, case 2 and case 3 work right. https://polarion.engineering.redhat.com/polarion/#/project/OSE/workitem?id=OCP-54326
Version-Release number of selected component (if applicable):
4.14.0-0.nightly-2023-06-05-112833 4.13.0-0.nightly-2023-06-06-194351 4.12.0-0.nightly-2023-06-07-005319
How reproducible:
Always
Steps to Reproduce:
1.Create MHC for control plane
apiVersion: machine.openshift.io/v1beta1
kind: MachineHealthCheck
metadata:
name: control-plane-health
namespace: openshift-machine-api
spec:
maxUnhealthy: 1
selector:
matchLabels:
machine.openshift.io/cluster-api-machine-type: master
unhealthyConditions:
- status: "False"
timeout: 300s
type: Ready
- status: "Unknown"
timeout: 300s
type: Ready
liuhuali@Lius-MacBook-Pro huali-test % oc create -f mhc-master3.yaml
machinehealthcheck.machine.openshift.io/control-plane-health created
liuhuali@Lius-MacBook-Pro huali-test % oc get mhc
NAME MAXUNHEALTHY EXPECTEDMACHINES CURRENTHEALTHY
control-plane-health 1 3 3
machine-api-termination-handler 100% 0 0
Case 2.Stop the kubelet service on the master node, new master get Running, the old one stuck in Deleting, many co degraded.
liuhuali@Lius-MacBook-Pro huali-test % oc debug node/huliu-az7c-svq9q-master-1
Starting pod/huliu-az7c-svq9q-master-1-debug ...
To use host binaries, run `chroot /host`
Pod IP: 10.0.0.6
If you don't see a command prompt, try pressing enter.
sh-4.4# chroot /host
sh-5.1# systemctl stop kubelet
Removing debug pod ...
liuhuali@Lius-MacBook-Pro huali-test % oc get node
NAME STATUS ROLES AGE VERSION
huliu-az7c-svq9q-master-1 Ready control-plane,master 95m v1.26.5+7a891f0
huliu-az7c-svq9q-master-2 Ready control-plane,master 95m v1.26.5+7a891f0
huliu-az7c-svq9q-master-c96k8-0 Ready control-plane,master 19m v1.26.5+7a891f0
huliu-az7c-svq9q-worker-westus-5r8jf Ready worker 34m v1.26.5+7a891f0
huliu-az7c-svq9q-worker-westus-k747l Ready worker 47m v1.26.5+7a891f0
huliu-az7c-svq9q-worker-westus-r2vdn Ready worker 83m v1.26.5+7a891f0
liuhuali@Lius-MacBook-Pro huali-test % oc get machine
NAME PHASE TYPE REGION ZONE AGE
huliu-az7c-svq9q-master-1 Running Standard_D8s_v3 westus 97m
huliu-az7c-svq9q-master-2 Running Standard_D8s_v3 westus 97m
huliu-az7c-svq9q-master-c96k8-0 Running Standard_D8s_v3 westus 23m
huliu-az7c-svq9q-worker-westus-5r8jf Running Standard_D4s_v3 westus 39m
huliu-az7c-svq9q-worker-westus-k747l Running Standard_D4s_v3 westus 53m
huliu-az7c-svq9q-worker-westus-r2vdn Running Standard_D4s_v3 westus 91m
liuhuali@Lius-MacBook-Pro huali-test % oc get node
NAME STATUS ROLES AGE VERSION
huliu-az7c-svq9q-master-1 NotReady control-plane,master 107m v1.26.5+7a891f0
huliu-az7c-svq9q-master-2 Ready control-plane,master 107m v1.26.5+7a891f0
huliu-az7c-svq9q-master-c96k8-0 Ready control-plane,master 32m v1.26.5+7a891f0
huliu-az7c-svq9q-master-jdhgg-1 Ready control-plane,master 2m10s v1.26.5+7a891f0
huliu-az7c-svq9q-worker-westus-5r8jf Ready worker 46m v1.26.5+7a891f0
huliu-az7c-svq9q-worker-westus-k747l Ready worker 59m v1.26.5+7a891f0
huliu-az7c-svq9q-worker-westus-r2vdn Ready worker 95m v1.26.5+7a891f0
liuhuali@Lius-MacBook-Pro huali-test % oc get machine
NAME PHASE TYPE REGION ZONE AGE
huliu-az7c-svq9q-master-1 Deleting Standard_D8s_v3 westus 110m
huliu-az7c-svq9q-master-2 Running Standard_D8s_v3 westus 110m
huliu-az7c-svq9q-master-c96k8-0 Running Standard_D8s_v3 westus 36m
huliu-az7c-svq9q-master-jdhgg-1 Running Standard_D8s_v3 westus 5m55s
huliu-az7c-svq9q-worker-westus-5r8jf Running Standard_D4s_v3 westus 52m
huliu-az7c-svq9q-worker-westus-k747l Running Standard_D4s_v3 westus 65m
huliu-az7c-svq9q-worker-westus-r2vdn Running Standard_D4s_v3 westus 103m
liuhuali@Lius-MacBook-Pro huali-test % oc get machine
NAME PHASE TYPE REGION ZONE AGE
huliu-az7c-svq9q-master-1 Deleting Standard_D8s_v3 westus 3h
huliu-az7c-svq9q-master-2 Running Standard_D8s_v3 westus 3h
huliu-az7c-svq9q-master-c96k8-0 Running Standard_D8s_v3 westus 105m
huliu-az7c-svq9q-master-jdhgg-1 Running Standard_D8s_v3 westus 75m
huliu-az7c-svq9q-worker-westus-5r8jf Running Standard_D4s_v3 westus 122m
huliu-az7c-svq9q-worker-westus-k747l Running Standard_D4s_v3 westus 135m
huliu-az7c-svq9q-worker-westus-r2vdn Running Standard_D4s_v3 westus 173m
liuhuali@Lius-MacBook-Pro huali-test % oc get node
NAME STATUS ROLES AGE VERSION
huliu-az7c-svq9q-master-1 NotReady control-plane,master 178m v1.26.5+7a891f0
huliu-az7c-svq9q-master-2 Ready control-plane,master 178m v1.26.5+7a891f0
huliu-az7c-svq9q-master-c96k8-0 Ready control-plane,master 102m v1.26.5+7a891f0
huliu-az7c-svq9q-master-jdhgg-1 Ready control-plane,master 72m v1.26.5+7a891f0
huliu-az7c-svq9q-worker-westus-5r8jf Ready worker 116m v1.26.5+7a891f0
huliu-az7c-svq9q-worker-westus-k747l Ready worker 129m v1.26.5+7a891f0
huliu-az7c-svq9q-worker-westus-r2vdn Ready worker 165m v1.26.5+7a891f0
liuhuali@Lius-MacBook-Pro huali-test % oc get co
NAME VERSION AVAILABLE PROGRESSING DEGRADED SINCE MESSAGE
authentication 4.13.0-0.nightly-2023-06-06-194351 True True True 107m APIServerDeploymentDegraded: 1 of 4 requested instances are unavailable for apiserver.openshift-oauth-apiserver ()...
baremetal 4.13.0-0.nightly-2023-06-06-194351 True False False 174m
cloud-controller-manager 4.13.0-0.nightly-2023-06-06-194351 True False False 176m
cloud-credential 4.13.0-0.nightly-2023-06-06-194351 True False False 3h
cluster-autoscaler 4.13.0-0.nightly-2023-06-06-194351 True False False 173m
config-operator 4.13.0-0.nightly-2023-06-06-194351 True False False 175m
console 4.13.0-0.nightly-2023-06-06-194351 True False False 136m
control-plane-machine-set 4.13.0-0.nightly-2023-06-06-194351 True False False 71m
csi-snapshot-controller 4.13.0-0.nightly-2023-06-06-194351 True False False 174m
dns 4.13.0-0.nightly-2023-06-06-194351 True True False 173m DNS "default" reports Progressing=True: "Have 6 available node-resolver pods, want 7."
etcd 4.13.0-0.nightly-2023-06-06-194351 True True True 173m NodeControllerDegraded: The master nodes not ready: node "huliu-az7c-svq9q-master-1" not ready since 2023-06-07 08:47:34 +0000 UTC because NodeStatusUnknown (Kubelet stopped posting node status.)
image-registry 4.13.0-0.nightly-2023-06-06-194351 True True False 165m Progressing: The registry is ready...
ingress 4.13.0-0.nightly-2023-06-06-194351 True False False 165m
insights 4.13.0-0.nightly-2023-06-06-194351 True False False 168m
kube-apiserver 4.13.0-0.nightly-2023-06-06-194351 True True True 171m NodeControllerDegraded: The master nodes not ready: node "huliu-az7c-svq9q-master-1" not ready since 2023-06-07 08:47:34 +0000 UTC because NodeStatusUnknown (Kubelet stopped posting node status.)
kube-controller-manager 4.13.0-0.nightly-2023-06-06-194351 True False True 171m NodeControllerDegraded: The master nodes not ready: node "huliu-az7c-svq9q-master-1" not ready since 2023-06-07 08:47:34 +0000 UTC because NodeStatusUnknown (Kubelet stopped posting node status.)
kube-scheduler 4.13.0-0.nightly-2023-06-06-194351 True False True 171m NodeControllerDegraded: The master nodes not ready: node "huliu-az7c-svq9q-master-1" not ready since 2023-06-07 08:47:34 +0000 UTC because NodeStatusUnknown (Kubelet stopped posting node status.)
kube-storage-version-migrator 4.13.0-0.nightly-2023-06-06-194351 True False False 106m
machine-api 4.13.0-0.nightly-2023-06-06-194351 True False False 167m
machine-approver 4.13.0-0.nightly-2023-06-06-194351 True False False 174m
machine-config 4.13.0-0.nightly-2023-06-06-194351 False False True 60m Cluster not available for [{operator 4.13.0-0.nightly-2023-06-06-194351}]: failed to apply machine config daemon manifests: error during waitForDaemonsetRollout: [timed out waiting for the condition, daemonset machine-config-daemon is not ready. status: (desired: 7, updated: 7, ready: 6, unavailable: 1)]
marketplace 4.13.0-0.nightly-2023-06-06-194351 True False False 174m
monitoring 4.13.0-0.nightly-2023-06-06-194351 True False False 106m
network 4.13.0-0.nightly-2023-06-06-194351 True True False 177m DaemonSet "/openshift-network-diagnostics/network-check-target" is not available (awaiting 1 nodes)...
node-tuning 4.13.0-0.nightly-2023-06-06-194351 True False False 173m
openshift-apiserver 4.13.0-0.nightly-2023-06-06-194351 True True True 107m APIServerDeploymentDegraded: 1 of 4 requested instances are unavailable for apiserver.openshift-apiserver ()
openshift-controller-manager 4.13.0-0.nightly-2023-06-06-194351 True False False 170m
openshift-samples 4.13.0-0.nightly-2023-06-06-194351 True False False 167m
operator-lifecycle-manager 4.13.0-0.nightly-2023-06-06-194351 True False False 174m
operator-lifecycle-manager-catalog 4.13.0-0.nightly-2023-06-06-194351 True False False 174m
operator-lifecycle-manager-packageserver 4.13.0-0.nightly-2023-06-06-194351 True False False 168m
service-ca 4.13.0-0.nightly-2023-06-06-194351 True False False 175m
storage 4.13.0-0.nightly-2023-06-06-194351 True True False 174m AzureDiskCSIDriverOperatorCRProgressing: AzureDiskDriverNodeServiceControllerProgressing: Waiting for DaemonSet to deploy node pods...
liuhuali@Lius-MacBook-Pro huali-test %
-----------------------
There might be an easier way by just rolling a revision in etcd, stopping kubelet and then observing the same issue.
Actual results:
CEO's member removal controller is getting stuck on the IsBootstrapComplete check that was introduced to fix another bug: https://github.com/openshift/cluster-etcd-operator/commit/c96150992a8aba3654835787be92188e947f557c#diff-d91047e39d2c1ab6b35e69359a24e83c19ad9b3e9ad4e44f9b1ac90e50f7b650R97 turns out IsBootstrapComplete checks whether a revision is currently rolling out (makes sense) and that one NotReady node with kubelet gone still has a revision going (rev 7, target 9). more info: https://issues.redhat.com/browse/OCPBUGS-14673?focusedId=22726712&page=com.atlassian.jira.plugin.system.issuetabpanels%3Acomment-tabpanel#comment-22726712 This causes the etcd member to not be removed. Which in turn blocks the vertical scale-down procedure to remove the pre-drain hook as the member is still present. Effectively you end up with a cluster of 4 CP machines, where one is stuck in Deleting state.
Expected results:
The etcd member should be removed and the machine/node should be deleted
Additional info:
Removing the revision check does fix this issue reliably, but might not be desirable: https://github.com/openshift/cluster-etcd-operator/pull/1087
- relates to
-
OCPBUGS-23044 [4.13] CEO prevents member deletion during revision rollout
-
- Closed
-