Loading...

XML

Word

Printable

Type: Bug
Resolution: Done-Errata
Priority: Major
Fix Version/s: 4.19.0
Affects Version/s: 4.18.0, 4.19.0
Component/s: kube-apiserver
Labels:
None

Activity Type:
Quality / Stability / Reliability
Blocked:
False
Blocked Reason:

Hide

None

Show
None
Story Points:
None
Severity:
None
Regression:
None

Target Backport Versions:
None
Target Version:

4.19.0
Release Blocker:
Rejected
Sprint:
None

SFDC Cases Counter:
SFDC Cases Open:
SFDC Cases Links:

Release Note Status:
Done
Release Note Type:
Bug Fix
Release Note Text:

Hide
* Previously, cluster bootstrap removal could break kube-apiserver readiness if etcd access was lost which could lead to downtime. With this release, each kube-apiserver has 2 stable etcd endpoints before removing bootstrap which maintains availability during rollout. (link:https://issues.redhat.com/browse/OCPBUGS-48673[~~OCPBUGS-48673~~])

Show
* Previously, cluster bootstrap removal could break kube-apiserver readiness if etcd access was lost which could lead to downtime. With this release, each kube-apiserver has 2 stable etcd endpoints before removing bootstrap which maintains availability during rollout. (link: https://issues.redhat.com/browse/OCPBUGS-48673 [ OCPBUGS-48673 ])

Escape Reason:
None
Escape Impact:
None
Corrective Measures:
None
SDLC stage when should've been found:
None

During the cluster bootstrap, disruption can occur when a kube-apiserver instance doesn't have access to any live etcd endpoints. This happens in one very specific scenario:

kube-apiserver is running on a node and is at revision 1. Its etcd-servers list contains the bootstrap node IP and localhost
when bootstrap node is deleted, the etcd instance that was running on it will become unavailable
when the etcd instance running the same node as the kube-apiserver instance from above is rolled-out to a new revision it will also become unavailable

When both of these scenarios happens whilst a kube-apiserver instance is still on revision 1, its readyz probe will fail

The suggested solution to fix this issue is to add a check in cluster-bootstrap that makes sure that we have at least 2 etcd-servers that are not bootstrap and localhost for each kube-apiserver pods before getting rid of the bootstrap node.

Job where this is happening: https://prow.ci.openshift.org/view/gs/test-platform-results/pr-logs/pull/openshift_cluster-etcd-operator/1387/pull-ci-openshift-cluster-etcd-operator-master-e2e-aws-ovn-serial/1880358740390055936

links to

openshift/cluster-kube-apiserver-operator#1792: OCPBUGS-48673: targetconfigcontroller: check live etcd endpoints

RHEA-2024:11038 OpenShift Container Platform 4.19.z bug fix update

Assignee:: Damien Grisonnet

Reporter:: Damien Grisonnet

Need Info From:: None

Contributors:: None

QA Contact:: Ke Wang

Doc Contact:: None

Votes:: 0 Vote for this issue

Watchers:: 6 Start watching this issue

Created:: 2025/01/21 1:11 PM

Updated:: 2025/07/16 1:44 PM

Resolved:: 2025/06/17 4:53 PM

Details

Description

Attachments

Issue Links

Easy Agile Planning Poker

Activity

People

Dates