Loading...

XML

Word

Printable

Type: Bug
Resolution: Unresolved
Priority: Undefined
Fix Version/s: None
Affects Version/s: 4.21.0
Component/s: Etcd
Labels:
None

Activity Type:
Quality / Stability / Reliability
Blocked:
False
Blocked Reason:

Hide

None

Show
None
Story Points:
None
Severity:
None
Regression:
None

Target Backport Versions:
None
Target Version:
None
Release Blocker:
None
Sprint:
None

SFDC Cases Counter:
SFDC Cases Open:
SFDC Cases Links:

PX Impact Score:

Release Note Status:
None
Release Note Type:
None
Release Note Text:
None

Escape Reason:
None
Escape Impact:
None
Corrective Measures:
None
SDLC stage when should've been found:
None

Description of problem:

During bootstrap we found that etcd stops entirely after the disk is filled with:

{"level":"fatal","ts":"2025-10-20T16:16:16.023600Z","caller":"backend/batch_tx.go:282","msg":"failed to commit tx","error":"write /var/lib/etcd/member/snap/db: no space left on device","stacktrace":"go.etcd.io/etcd/server/v3/mvcc/backend.(*batchTx).commit\n\tgo.etcd.io/etcd/server/v3/mvcc/backend/batch_tx.go:282\ngo.etcd.io/etcd/server/v3/mvcc/backend.(*batchTxBuffered).unsafeCommit\n\tgo.etcd.io/etcd/server/v3/mvcc/backend/batch_tx.go:379\ngo.etcd.io/etcd/server/v3/mvcc/backend.(*batchTxBuffered).commit\n\tgo.etcd.io/etcd/server/v3/mvcc/backend/batch_tx.go:357\ngo.etcd.io/etcd/server/v3/mvcc/backend.(*batchTxBuffered).Commit\n\tgo.etcd.io/etcd/server/v3/mvcc/backend/batch_tx.go:344\ngo.etcd.io/etcd/server/v3/mvcc/backend.(*backend).run\n\tgo.etcd.io/etcd/server/v3/mvcc/backend/backend.go:440"}

    
This is the last log, nothing else seems to indicate it is responding. The process itself is not exiting however.

Version-Release number of selected component (if applicable):

4.21 but likely existed since the beginning of the operator

How reproducible:

always

Steps to Reproduce:

    1. fill up the disk during bootstrap
    2. wait for bootkube to fail
    3. observe etcd being stuck

cchun@redhat.com can help to repro through the initial ticket in MGMT-21812

Actual results:

bootkube etcd seems stuck, but is also not restarted by kubelet

Expected results:

bootkube etcd would be restarted and try to continue its service

Additional info:

The bootkube etcd has restartPolicy=Always set:
https://github.com/openshift/cluster-etcd-operator/blob/main/bindata/bootkube/bootstrap-manifests/etcd-member-pod.yaml#L90

Why is it not restarting here? Is it because we are missing the liveness probes?     

We expect that some space will be freed in the background, more specifically by kubelet doing container image GC - so we can assume that etcd would be operational after a restart.

is caused by

OCPBUGS-62790 ABI vSphere Installation Failing Due to “No Space Left on Device”

Closed

Assignee:: Dean West

Reporter:: Thomas Jungblut

Need Info From:: None

Contributors:: None

QA Contact:: Ge Liu

Doc Contact:: None

Votes:: 0 Vote for this issue

Watchers:: 3 Start watching this issue

Created:: 2025/10/31 3:14 PM

Updated:: 2025/12/08 7:08 PM

Details

Description

Attachments

Issue Links

Easy Agile Planning Poker

Activity

People

Dates