-
Bug
-
Resolution: Unresolved
-
Undefined
-
None
-
4.21.0
-
None
-
Quality / Stability / Reliability
-
False
-
-
None
-
None
-
None
-
None
-
None
-
None
-
None
-
None
-
None
-
None
-
None
-
None
-
None
-
None
Description of problem:
During bootstrap we found that etcd stops entirely after the disk is filled with:
{"level":"fatal","ts":"2025-10-20T16:16:16.023600Z","caller":"backend/batch_tx.go:282","msg":"failed to commit tx","error":"write /var/lib/etcd/member/snap/db: no space left on device","stacktrace":"go.etcd.io/etcd/server/v3/mvcc/backend.(*batchTx).commit\n\tgo.etcd.io/etcd/server/v3/mvcc/backend/batch_tx.go:282\ngo.etcd.io/etcd/server/v3/mvcc/backend.(*batchTxBuffered).unsafeCommit\n\tgo.etcd.io/etcd/server/v3/mvcc/backend/batch_tx.go:379\ngo.etcd.io/etcd/server/v3/mvcc/backend.(*batchTxBuffered).commit\n\tgo.etcd.io/etcd/server/v3/mvcc/backend/batch_tx.go:357\ngo.etcd.io/etcd/server/v3/mvcc/backend.(*batchTxBuffered).Commit\n\tgo.etcd.io/etcd/server/v3/mvcc/backend/batch_tx.go:344\ngo.etcd.io/etcd/server/v3/mvcc/backend.(*backend).run\n\tgo.etcd.io/etcd/server/v3/mvcc/backend/backend.go:440"}
This is the last log, nothing else seems to indicate it is responding. The process itself is not exiting however.
Version-Release number of selected component (if applicable):
4.21 but likely existed since the beginning of the operator
How reproducible:
always
Steps to Reproduce:
1. fill up the disk during bootstrap
2. wait for bootkube to fail
3. observe etcd being stuck
cchun@redhat.com can help to repro through the initial ticket in MGMT-21812
Actual results:
bootkube etcd seems stuck, but is also not restarted by kubelet
Expected results:
bootkube etcd would be restarted and try to continue its service
Additional info:
The bootkube etcd has restartPolicy=Always set: https://github.com/openshift/cluster-etcd-operator/blob/main/bindata/bootkube/bootstrap-manifests/etcd-member-pod.yaml#L90 Why is it not restarting here? Is it because we are missing the liveness probes? We expect that some space will be freed in the background, more specifically by kubelet doing container image GC - so we can assume that etcd would be operational after a restart.