-
Bug
-
Resolution: Done
-
Normal
-
4.12.0, 4.11, 4.10.z, 4.9.z, 4.8.z
Description of problem:
etcd and kube-apiserver pods get restarted due to failed liveness probes while deleting/re-creating pods on SNO
Version-Release number of selected component (if applicable):
4.10.32
How reproducible:
Not always, after ~10 attempts
Steps to Reproduce:
1. Deploy SNO with Telco DU profile applied 2. Create multiple pods with local storage volumes attached(attaching yaml manifest) 3. Force delete and re-create pods 10 times
Actual results:
etcd and kube-apiserver pods get restarted, making to cluster unavailable for a period of time
Expected results:
etcd and kube-apiserver do not get restarted
Additional info:
Attaching must-gather. Please let me know if any additional info is required. Thank you!
- blocks
-
OCPBUGS-2113 [4.11] etcd and kube-apiserver pods get restarted due to failed liveness probes while deleting/re-creating pods on SNO
-
- Closed
-
- is cloned by
-
OCPBUGS-2113 [4.11] etcd and kube-apiserver pods get restarted due to failed liveness probes while deleting/re-creating pods on SNO
-
- Closed
-
- links to
hi tjungblu@redhat.com geliu how did you validate this change?
in the pr you mention a test with sha1sum, was it running within the etcd container, and so the niceness has any effect?
Shouldn't the CPU scheduling be more influenced with the CPUWeight derived from the CPU request, that today is hardcoded at 300m:
https://github.com/openshift/cluster-etcd-operator/blob/5223e3752616e8f3906254bfeccf3b75ce459872/bindata/etcd/pod.yaml#L196
My understanding is setting the niceness should not have any effect. In the yaml file attached there are multiple containers all running with larger requests, the scheduler will use that and give less weight to eg etcd.