Loading...

XML

Word

Printable

Type: Bug
Resolution: Done
Priority: Normal
Fix Version/s: 4.18.0
Affects Version/s: 4.14, 4.15, 4.16, 4.17, 4.18
Component/s: Etcd
Labels:
- pre-merge-tested

Activity Type:
Quality / Stability / Reliability
Blocked:
False
Blocked Reason:
None
Story Points:
3
Severity:
Moderate
Regression:
Yes

Target Backport Versions:
None
Target Version:

4.18.0
Release Blocker:
None
Sprint:
ETCD Sprint 243, ETCD Sprint 245, ETCD Sprint 246
sprint_count:
3

SFDC Cases Counter:
SFDC Cases Open:
SFDC Cases Links:

PX Priority Data:
PX Impact Score:

Release Note Status:
Done
Release Note Type:
Bug Fix
Release Note Text:

Hide
* Previously, when you stopped the kubelet service on a control plane node and deleted the machine, a new master machine was initiated, but the old one was stuck in the Deleting state. Deleting the machine triggered a new machine and a revision rollout. That rollout started with the etcd pod on the node where the kubelet service was stopped, which caused the rollout process to get stuck. Meanwhile, the member removal controller tried to remove the old etcd member, but the `IsBootstrapComplete` check failed due to the stalled revision rollout. That failure prevented the old member from being deleted and created a deadlock. The deletion was blocked by the revision stability check in the `IsBootstrapComplete` function.

As a consequence, the old deleted machine was stuck in the Deleting state and many cluster Operators were degraded.

With this update, the revision stability check is a false-positive check for whether the bootstrap is completed. The revision stability check is separated from the bootstrap completeness check, ensuring that only controllers that require the stability check call it.

As a result, the old machine that was stuck in the Deleting state can be successfully deleted.

Show
* Previously, when you stopped the kubelet service on a control plane node and deleted the machine, a new master machine was initiated, but the old one was stuck in the Deleting state. Deleting the machine triggered a new machine and a revision rollout. That rollout started with the etcd pod on the node where the kubelet service was stopped, which caused the rollout process to get stuck. Meanwhile, the member removal controller tried to remove the old etcd member, but the `IsBootstrapComplete` check failed due to the stalled revision rollout. That failure prevented the old member from being deleted and created a deadlock. The deletion was blocked by the revision stability check in the `IsBootstrapComplete` function. As a consequence, the old deleted machine was stuck in the Deleting state and many cluster Operators were degraded. With this update, the revision stability check is a false-positive check for whether the bootstrap is completed. The revision stability check is separated from the bootstrap completeness check, ensuring that only controllers that require the stability check call it. As a result, the old machine that was stuck in the Deleting state can be successfully deleted.

Escape Reason:
None
Escape Impact:
None
Corrective Measures:
None
SDLC stage when should've been found:
None

this is case 2 from ~~OCPBUGS-14673~~

Description of problem:

MHC for control plane cannot work right for some cases

2.Stop the kubelet service on the master node, new master get Running, the old one stuck in Deleting, many co degraded.

This is a regression bug, because I tested this on 4.12 around September 2022, case 2 and case 3 work right.
https://polarion.engineering.redhat.com/polarion/#/project/OSE/workitem?id=OCP-54326

Version-Release number of selected component (if applicable):

4.14.0-0.nightly-2023-06-05-112833
4.13.0-0.nightly-2023-06-06-194351
4.12.0-0.nightly-2023-06-07-005319

How reproducible:

Always

Steps to Reproduce:

1.Create MHC for control plane

apiVersion: machine.openshift.io/v1beta1
kind: MachineHealthCheck
metadata:
  name: control-plane-health
  namespace: openshift-machine-api
spec:
  maxUnhealthy: 1
  selector:
    matchLabels:
      machine.openshift.io/cluster-api-machine-type: master
  unhealthyConditions:
  - status: "False"
    timeout: 300s
    type: Ready
  - status: "Unknown"
    timeout: 300s
    type: Ready


liuhuali@Lius-MacBook-Pro huali-test % oc create -f mhc-master3.yaml 
machinehealthcheck.machine.openshift.io/control-plane-health created
liuhuali@Lius-MacBook-Pro huali-test % oc get mhc
NAME                              MAXUNHEALTHY   EXPECTEDMACHINES   CURRENTHEALTHY
control-plane-health              1              3                  3
machine-api-termination-handler   100%           0                  0 

Case 2.Stop the kubelet service on the master node, new master get Running, the old one stuck in Deleting, many co degraded.
liuhuali@Lius-MacBook-Pro huali-test % oc debug node/huliu-az7c-svq9q-master-1 
Starting pod/huliu-az7c-svq9q-master-1-debug ...
To use host binaries, run `chroot /host`
Pod IP: 10.0.0.6
If you don't see a command prompt, try pressing enter.
sh-4.4# chroot /host
sh-5.1# systemctl stop kubelet


Removing debug pod ...
liuhuali@Lius-MacBook-Pro huali-test % oc get node
NAME                                   STATUS   ROLES                  AGE   VERSION
huliu-az7c-svq9q-master-1              Ready    control-plane,master   95m   v1.26.5+7a891f0
huliu-az7c-svq9q-master-2              Ready    control-plane,master   95m   v1.26.5+7a891f0
huliu-az7c-svq9q-master-c96k8-0        Ready    control-plane,master   19m   v1.26.5+7a891f0
huliu-az7c-svq9q-worker-westus-5r8jf   Ready    worker                 34m   v1.26.5+7a891f0
huliu-az7c-svq9q-worker-westus-k747l   Ready    worker                 47m   v1.26.5+7a891f0
huliu-az7c-svq9q-worker-westus-r2vdn   Ready    worker                 83m   v1.26.5+7a891f0
liuhuali@Lius-MacBook-Pro huali-test % oc get machine
NAME                                   PHASE     TYPE              REGION   ZONE   AGE
huliu-az7c-svq9q-master-1              Running   Standard_D8s_v3   westus          97m
huliu-az7c-svq9q-master-2              Running   Standard_D8s_v3   westus          97m
huliu-az7c-svq9q-master-c96k8-0        Running   Standard_D8s_v3   westus          23m
huliu-az7c-svq9q-worker-westus-5r8jf   Running   Standard_D4s_v3   westus          39m
huliu-az7c-svq9q-worker-westus-k747l   Running   Standard_D4s_v3   westus          53m
huliu-az7c-svq9q-worker-westus-r2vdn   Running   Standard_D4s_v3   westus          91m
liuhuali@Lius-MacBook-Pro huali-test % oc get node
NAME                                   STATUS     ROLES                  AGE     VERSION
huliu-az7c-svq9q-master-1              NotReady   control-plane,master   107m    v1.26.5+7a891f0
huliu-az7c-svq9q-master-2              Ready      control-plane,master   107m    v1.26.5+7a891f0
huliu-az7c-svq9q-master-c96k8-0        Ready      control-plane,master   32m     v1.26.5+7a891f0
huliu-az7c-svq9q-master-jdhgg-1        Ready      control-plane,master   2m10s   v1.26.5+7a891f0
huliu-az7c-svq9q-worker-westus-5r8jf   Ready      worker                 46m     v1.26.5+7a891f0
huliu-az7c-svq9q-worker-westus-k747l   Ready      worker                 59m     v1.26.5+7a891f0
huliu-az7c-svq9q-worker-westus-r2vdn   Ready      worker                 95m     v1.26.5+7a891f0
liuhuali@Lius-MacBook-Pro huali-test % oc get machine
NAME                                   PHASE      TYPE              REGION   ZONE   AGE
huliu-az7c-svq9q-master-1              Deleting   Standard_D8s_v3   westus          110m
huliu-az7c-svq9q-master-2              Running    Standard_D8s_v3   westus          110m
huliu-az7c-svq9q-master-c96k8-0        Running    Standard_D8s_v3   westus          36m
huliu-az7c-svq9q-master-jdhgg-1        Running    Standard_D8s_v3   westus          5m55s
huliu-az7c-svq9q-worker-westus-5r8jf   Running    Standard_D4s_v3   westus          52m
huliu-az7c-svq9q-worker-westus-k747l   Running    Standard_D4s_v3   westus          65m
huliu-az7c-svq9q-worker-westus-r2vdn   Running    Standard_D4s_v3   westus          103m
liuhuali@Lius-MacBook-Pro huali-test % oc get machine
NAME                                   PHASE      TYPE              REGION   ZONE   AGE
huliu-az7c-svq9q-master-1              Deleting   Standard_D8s_v3   westus          3h
huliu-az7c-svq9q-master-2              Running    Standard_D8s_v3   westus          3h
huliu-az7c-svq9q-master-c96k8-0        Running    Standard_D8s_v3   westus          105m
huliu-az7c-svq9q-master-jdhgg-1        Running    Standard_D8s_v3   westus          75m
huliu-az7c-svq9q-worker-westus-5r8jf   Running    Standard_D4s_v3   westus          122m
huliu-az7c-svq9q-worker-westus-k747l   Running    Standard_D4s_v3   westus          135m
huliu-az7c-svq9q-worker-westus-r2vdn   Running    Standard_D4s_v3   westus          173m
liuhuali@Lius-MacBook-Pro huali-test % oc get node   
NAME                                   STATUS     ROLES                  AGE    VERSION
huliu-az7c-svq9q-master-1              NotReady   control-plane,master   178m   v1.26.5+7a891f0
huliu-az7c-svq9q-master-2              Ready      control-plane,master   178m   v1.26.5+7a891f0
huliu-az7c-svq9q-master-c96k8-0        Ready      control-plane,master   102m   v1.26.5+7a891f0
huliu-az7c-svq9q-master-jdhgg-1        Ready      control-plane,master   72m    v1.26.5+7a891f0
huliu-az7c-svq9q-worker-westus-5r8jf   Ready      worker                 116m   v1.26.5+7a891f0
huliu-az7c-svq9q-worker-westus-k747l   Ready      worker                 129m   v1.26.5+7a891f0
huliu-az7c-svq9q-worker-westus-r2vdn   Ready      worker                 165m   v1.26.5+7a891f0
liuhuali@Lius-MacBook-Pro huali-test % oc get co
NAME                                       VERSION                              AVAILABLE   PROGRESSING   DEGRADED   SINCE   MESSAGE
authentication                             4.13.0-0.nightly-2023-06-06-194351   True        True          True       107m    APIServerDeploymentDegraded: 1 of 4 requested instances are unavailable for apiserver.openshift-oauth-apiserver ()...
baremetal                                  4.13.0-0.nightly-2023-06-06-194351   True        False         False      174m    
cloud-controller-manager                   4.13.0-0.nightly-2023-06-06-194351   True        False         False      176m    
cloud-credential                           4.13.0-0.nightly-2023-06-06-194351   True        False         False      3h      
cluster-autoscaler                         4.13.0-0.nightly-2023-06-06-194351   True        False         False      173m    
config-operator                            4.13.0-0.nightly-2023-06-06-194351   True        False         False      175m    
console                                    4.13.0-0.nightly-2023-06-06-194351   True        False         False      136m    
control-plane-machine-set                  4.13.0-0.nightly-2023-06-06-194351   True        False         False      71m     
csi-snapshot-controller                    4.13.0-0.nightly-2023-06-06-194351   True        False         False      174m    
dns                                        4.13.0-0.nightly-2023-06-06-194351   True        True          False      173m    DNS "default" reports Progressing=True: "Have 6 available node-resolver pods, want 7."
etcd                                       4.13.0-0.nightly-2023-06-06-194351   True        True          True       173m    NodeControllerDegraded: The master nodes not ready: node "huliu-az7c-svq9q-master-1" not ready since 2023-06-07 08:47:34 +0000 UTC because NodeStatusUnknown (Kubelet stopped posting node status.)
image-registry                             4.13.0-0.nightly-2023-06-06-194351   True        True          False      165m    Progressing: The registry is ready...
ingress                                    4.13.0-0.nightly-2023-06-06-194351   True        False         False      165m    
insights                                   4.13.0-0.nightly-2023-06-06-194351   True        False         False      168m    
kube-apiserver                             4.13.0-0.nightly-2023-06-06-194351   True        True          True       171m    NodeControllerDegraded: The master nodes not ready: node "huliu-az7c-svq9q-master-1" not ready since 2023-06-07 08:47:34 +0000 UTC because NodeStatusUnknown (Kubelet stopped posting node status.)
kube-controller-manager                    4.13.0-0.nightly-2023-06-06-194351   True        False         True       171m    NodeControllerDegraded: The master nodes not ready: node "huliu-az7c-svq9q-master-1" not ready since 2023-06-07 08:47:34 +0000 UTC because NodeStatusUnknown (Kubelet stopped posting node status.)
kube-scheduler                             4.13.0-0.nightly-2023-06-06-194351   True        False         True       171m    NodeControllerDegraded: The master nodes not ready: node "huliu-az7c-svq9q-master-1" not ready since 2023-06-07 08:47:34 +0000 UTC because NodeStatusUnknown (Kubelet stopped posting node status.)
kube-storage-version-migrator              4.13.0-0.nightly-2023-06-06-194351   True        False         False      106m    
machine-api                                4.13.0-0.nightly-2023-06-06-194351   True        False         False      167m    
machine-approver                           4.13.0-0.nightly-2023-06-06-194351   True        False         False      174m    
machine-config                             4.13.0-0.nightly-2023-06-06-194351   False       False         True       60m     Cluster not available for [{operator 4.13.0-0.nightly-2023-06-06-194351}]: failed to apply machine config daemon manifests: error during waitForDaemonsetRollout: [timed out waiting for the condition, daemonset machine-config-daemon is not ready. status: (desired: 7, updated: 7, ready: 6, unavailable: 1)]
marketplace                                4.13.0-0.nightly-2023-06-06-194351   True        False         False      174m    
monitoring                                 4.13.0-0.nightly-2023-06-06-194351   True        False         False      106m    
network                                    4.13.0-0.nightly-2023-06-06-194351   True        True          False      177m    DaemonSet "/openshift-network-diagnostics/network-check-target" is not available (awaiting 1 nodes)...
node-tuning                                4.13.0-0.nightly-2023-06-06-194351   True        False         False      173m    
openshift-apiserver                        4.13.0-0.nightly-2023-06-06-194351   True        True          True       107m    APIServerDeploymentDegraded: 1 of 4 requested instances are unavailable for apiserver.openshift-apiserver ()
openshift-controller-manager               4.13.0-0.nightly-2023-06-06-194351   True        False         False      170m    
openshift-samples                          4.13.0-0.nightly-2023-06-06-194351   True        False         False      167m    
operator-lifecycle-manager                 4.13.0-0.nightly-2023-06-06-194351   True        False         False      174m    
operator-lifecycle-manager-catalog         4.13.0-0.nightly-2023-06-06-194351   True        False         False      174m    
operator-lifecycle-manager-packageserver   4.13.0-0.nightly-2023-06-06-194351   True        False         False      168m    
service-ca                                 4.13.0-0.nightly-2023-06-06-194351   True        False         False      175m    
storage                                    4.13.0-0.nightly-2023-06-06-194351   True        True          False      174m    AzureDiskCSIDriverOperatorCRProgressing: AzureDiskDriverNodeServiceControllerProgressing: Waiting for DaemonSet to deploy node pods...
liuhuali@Lius-MacBook-Pro huali-test % 

-----------------------

There might be an easier way by just rolling a revision in etcd, stopping kubelet and then observing the same issue.

Actual results:

CEO's member removal controller is getting stuck on the IsBootstrapComplete check that was introduced to fix another bug: 

 https://github.com/openshift/cluster-etcd-operator/commit/c96150992a8aba3654835787be92188e947f557c#diff-d91047e39d2c1ab6b35e69359a24e83c19ad9b3e9ad4e44f9b1ac90e50f7b650R97 

 turns out IsBootstrapComplete checks whether a revision is currently rolling out (makes sense) and that one NotReady node with kubelet gone still has a revision going (rev 7, target 9).

more info: https://issues.redhat.com/browse/OCPBUGS-14673?focusedId=22726712&page=com.atlassian.jira.plugin.system.issuetabpanels%3Acomment-tabpanel#comment-22726712

This causes the etcd member to not be removed. 

Which in turn blocks the vertical scale-down procedure to remove the pre-drain hook as the member is still present. Effectively you end up with a cluster of 4 CP machines, where one is stuck in Deleting state.

Expected results:

The etcd member should be removed and the machine/node should be deleted

Additional info:

Removing the revision check does fix this issue reliably, but might not be desirable:
https://github.com/openshift/cluster-etcd-operator/pull/1087

blocks

OCPBUGS-21802 [4.14] CEO prevents member deletion during revision rollout

Closed

OCPBUGS-43003 CEO prevents member deletion during revision rollout

Closed

causes

OCPBUGS-43387 ABI vSphere cluster installation fails due to etcd operator stuck in In-Progress status

Verified

clones

OCPBUGS-14673 MHC for control plane cannot work right for some cases

Closed

OCPBUGS-23044 [4.13] CEO prevents member deletion during revision rollout

Closed

duplicates

OCPBUGS-32403 CEO prevents member deletion during revision rollout

Closed

is cloned by

OCPBUGS-21802 [4.14] CEO prevents member deletion during revision rollout

Closed

OCPBUGS-27333 [ocp4.13]CEO prevents member deletion during revision rollout

Closed

OCPBUGS-43003 CEO prevents member deletion during revision rollout

Closed

links to

openshift/cluster-etcd-operator#1087: OCPBUGS-17199: remove revision stability check from bootstrap complet…

openshift/cluster-etcd-operator#1167: Revert "OCPBUGS-17199: remove revision stability check from bootstrap complet…"

openshift/cluster-etcd-operator#1350: OCPBUGS-17199: remove stability check from bootstrap completeness

openshift/cluster-etcd-operator#1357: OCPBUGS-17199: Revert pull request https://github.com/openshift/cluster-etcd-operator/issues/1350

openshift/cluster-etcd-operator#1360: OCPBUGS-17199: Separated the revision stability check from the bootstrap completeness check

RHEA-2023:7198 rpm

RHEA-2024:6122 OpenShift Container Platform 4.18.z bug fix update

(1 duplicates, 3 is cloned by, 7 links to)

Assignee:: Jubitta John

Reporter:: Huali Liu

Need Info From:: None

Contributors:: None

QA Contact:: Sandeep Kundu

Doc Contact:: Laura Hinson

Votes:: 0 Vote for this issue

Watchers:: 17 Start watching this issue

Created:: 2023/08/02 1:06 PM

Updated:: 2025/09/13 4:05 PM

Resolved:: 2025/01/28 7:19 PM

Details

Description

Attachments

Issue Links

Easy Agile Planning Poker

Activity

People

Dates