[OCPBUGS-1329] etcd and kube-apiserver pods get restarted due to failed liveness probes while deleting/re-creating pods on SNO

Type: Bug
Resolution: Done
Priority: Normal
Fix Version/s: 4.12.0
Affects Version/s: 4.12.0, 4.11, 4.10.z, 4.9.z, 4.8.z
Component/s: Etcd
Labels:
- 4.12_FarEdge_System_Test

Severity:
Critical
Regression:
None
Sprint:
ETCD Sprint 225, ETCD Sprint 226
sprint_count:
2
Release Blocker:
Proposed
Blocked:
False
Blocked Reason:

Hide

None

Show
None
Internal Whiteboard:
Target Version:

4.12.0

SFDC Cases Counter:
SFDC Cases Open:
SFDC Cases Links:

PX Impact Score:
PX Priority Data:

Description of problem:

etcd and kube-apiserver pods get restarted due to failed liveness probes while deleting/re-creating pods on SNO

Version-Release number of selected component (if applicable):

4.10.32

How reproducible:

Not always, after ~10 attempts

Steps to Reproduce:

1. Deploy SNO with Telco DU profile applied
2. Create multiple pods with local storage volumes attached(attaching yaml manifest)
3. Force delete and re-create pods 10 times

Actual results:

etcd and kube-apiserver pods get restarted, making to cluster unavailable for a period of time

Expected results:

etcd and kube-apiserver do not get restarted

Additional info:

Attaching must-gather.

Please let me know if any additional info is required. Thank you!

- - Sort By Name
  - Sort By Date
  - Ascending
  - Descending
  - Thumbnails
  - List
  - Download All

test_ztp_du_local_volumes_force_delete.yaml
78 kB
2022/09/14 3:45 PM

blocks

OCPBUGS-2113 [4.11] etcd and kube-apiserver pods get restarted due to failed liveness probes while deleting/re-creating pods on SNO

Closed

is cloned by

OCPBUGS-2113 [4.11] etcd and kube-apiserver pods get restarted due to failed liveness probes while deleting/re-creating pods on SNO

Closed

links to

openshift/cluster-etcd-operator#938: OCPBUGS-1329: Add niceness to important etcd processes

Francois Rigault added a comment - 2024/04/21 2:25 PM - edited

hi tjungblu@redhat.com geliu how did you validate this change?

in the pr you mention a test with sha1sum, was it running within the etcd container, and so the niceness has any effect?

Shouldn't the CPU scheduling be more influenced with the CPUWeight derived from the CPU request, that today is hardcoded at 300m:

https://github.com/openshift/cluster-etcd-operator/blob/5223e3752616e8f3906254bfeccf3b75ce459872/bindata/etcd/pod.yaml#L196

My understanding is setting the niceness should not have any effect. In the yaml file attached there are multiple containers all running with larger requests, the scheduler will use that and give less weight to eg etcd.

Francois Rigault added a comment - 2024/04/21 2:25 PM - edited hi tjungblu@redhat.com geliu how did you validate this change? in the pr you mention a test with sha1sum, was it running within the etcd container, and so the niceness has any effect? Shouldn't the CPU scheduling be more influenced with the CPUWeight derived from the CPU request, that today is hardcoded at 300m: https://github.com/openshift/cluster-etcd-operator/blob/5223e3752616e8f3906254bfeccf3b75ce459872/bindata/etcd/pod.yaml#L196 My understanding is setting the niceness should not have any effect. In the yaml file attached there are multiple containers all running with larger requests, the scheduler will use that and give less weight to eg etcd.

Errata Tool added a comment - 2023/01/17 7:40 PM

Since the problem described in this issue should be resolved in a recent advisory, it has been closed.

For information on the advisory, and where to find the updated files, follow the link below.

If the solution does not work for you, open a new bug report.
https://access.redhat.com/errata/RHSA-2022:7399

Errata Tool added a comment - 2023/01/17 7:40 PM Since the problem described in this issue should be resolved in a recent advisory, it has been closed. For information on the advisory, and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHSA-2022:7399

Jennifer Chen added a comment - 2022/11/11 1:40 PM

I observed the issue with 4.10.40 on Dell server.

fatal: [localhost]: FAILED! => {"changed": true, "cmd": ["oc", "--kubeconfig", "/tmp/cluster-auth/cloudransno-site1/kubeconfig", "get", "pods", "-n", "test", "-o", "wide"], "delta": "0:00:00.053406", "end": "2022-11-10 21:14:06.403229", "msg": "non-zero return code", "rc": 1, "start": "2022-11-10 21:14:06.349823", "stderr": "The connection to the server api.cloudransno-site1.ecr1.bos2.lab:6443 was refused - did you specify the right host or port?", "stderr_lines": ["The connection to the server api.cloudransno-site1.ecr1.bos2.lab:6443 was refused - did you specify the right host or port?"], "stdout": "",

Jennifer Chen added a comment - 2022/11/11 1:40 PM I observed the issue with 4.10.40 on Dell server. fatal: [localhost] : FAILED! => {"changed": true, "cmd": ["oc", "--kubeconfig", "/tmp/cluster-auth/cloudransno-site1/kubeconfig", "get", "pods", "-n", "test", "-o", "wide"] , "delta": "0:00:00.053406", "end": "2022-11-10 21:14:06.403229", "msg": "non-zero return code", "rc": 1, "start": "2022-11-10 21:14:06.349823", "stderr": "The connection to the server api.cloudransno-site1.ecr1.bos2.lab:6443 was refused - did you specify the right host or port?", "stderr_lines": ["The connection to the server api.cloudransno-site1.ecr1.bos2.lab:6443 was refused - did you specify the right host or port?"] , "stdout": "",

Thomas Jungblut added a comment - 2022/10/25 1:27 PM

sure.

Thomas Jungblut added a comment - 2022/10/25 1:27 PM sure.

Marius Cornea added a comment - 2022/10/21 8:17 AM

I wasn't able to reproduce the issue with a recent nigthly 4.12 build(4.12.0-0.nightly-2022-10-18-192348). I am also seeing etcd process niceness set according the code change so moving to verified.

PID NI CMD
6809 -19 etcd --logger=zap --log-level=info --experimental-initial-corrupt-check=true --initial-advertise-peer-urls=https://[2620:52:0:198::10]:2380 --cert-file=/etc/kubernetes/static-pod-certs/secrets/etcd-all-certs/etcd-serving-sno.kni-qe-1.lab.eng.rdu2.redhat.com.crt --key-file=/etc/kubernetes/static-pod-certs/secrets/etcd-all-certs/et
6873 ~~18 etcd grpc-proxy start --endpoints https://[2620:52:0:198::10]:9978 --metrics-addr https://0.0.0.0:9979 --listen-addr 127.0.0.1:9977 --advertise-client-url~~ -key /etc/kubernetes/static-pod-certs/secrets/etcd-all-certs/etcd-peer-sno.kni-qe-1.lab.eng.rdu2.redhat.com.key --key-file /etc/kubernetes/static-pod-certs/secrets/etcd-all-c

Marius Cornea added a comment - 2022/10/21 8:17 AM I wasn't able to reproduce the issue with a recent nigthly 4.12 build(4.12.0-0.nightly-2022-10-18-192348). I am also seeing etcd process niceness set according the code change so moving to verified. PID NI CMD 6809 -19 etcd --logger=zap --log-level=info --experimental-initial-corrupt-check=true --initial-advertise-peer-urls= https://[2620:52:0:198::10]:2380 --cert-file=/etc/kubernetes/static-pod-certs/secrets/etcd-all-certs/etcd-serving-sno.kni-qe-1.lab.eng.rdu2.redhat.com.crt --key-file=/etc/kubernetes/static-pod-certs/secrets/etcd-all-certs/et 6873 18 etcd grpc-proxy start --endpoints https://[2620:52:0:198::10]:9978 --metrics-addr https://0.0.0.0:9979 --listen-addr 127.0.0.1:9977 --advertise-client-url -key /etc/kubernetes/static-pod-certs/secrets/etcd-all-certs/etcd-peer-sno.kni-qe-1.lab.eng.rdu2.redhat.com.key --key-file /etc/kubernetes/static-pod-certs/secrets/etcd-all-c

Ge Liu added a comment - 2022/10/13 3:46 AM - edited

Hello mcornea@redhat.com, browsell@redhat.comWe can't deploy cluster with Telco DU profile applied, could you please help to verify it or provide some instrunction to me how to deploy it, thanks

Ge Liu added a comment - 2022/10/13 3:46 AM - edited Hello mcornea@redhat.com , browsell@redhat.com We can't deploy cluster with Telco DU profile applied, could you please help to verify it or provide some instrunction to me how to deploy it, thanks

Marius Cornea added a comment - 2022/09/15 12:37 PM

Adding another data point: the issue was observed on 4.9.48 version as well.

Marius Cornea added a comment - 2022/09/15 12:37 PM Adding another data point: the issue was observed on 4.9.48 version as well.

Peter Hunt added a comment - 2022/09/14 4:15 PM

Harshal, can you PTAL?

(often getting the goroutine stacks https://github.com/cri-o/cri-o/blob/main/tutorials/debugging.md#printing-go-routines helps with these cases, can folks grab those?)

Peter Hunt added a comment - 2022/09/14 4:15 PM Harshal, can you PTAL? (often getting the goroutine stacks https://github.com/cri-o/cri-o/blob/main/tutorials/debugging.md#printing-go-routines helps with these cases, can folks grab those?)

Assignee:: Thomas Jungblut

Reporter:: Marius Cornea

QA Contact:: Ge Liu

Votes:: 0 Vote for this issue

Watchers:: 16 Start watching this issue

Created:: 2022/09/14 3:44 PM

Updated:: 2024/06/06 1:47 PM

Resolved:: 2023/01/17 7:40 PM

Details

Description

Attachments

Attachments

Issue Links

Easy Agile Planning Poker

Activity

Collapse comment: Francois Rigault added a comment - 2024/04/21 2:25 PM, Edited by Francois Rigault - 2024/04/21 2:26 PM

Expand comment: Francois Rigault added a comment - 2024/04/21 2:25 PM, Edited by Francois Rigault - 2024/04/21 2:26 PM

Collapse comment: Errata Tool added a comment - 2023/01/17 7:40 PM

Expand comment: Errata Tool added a comment - 2023/01/17 7:40 PM

Collapse comment: Jennifer Chen added a comment - 2022/11/11 1:40 PM

Expand comment: Jennifer Chen added a comment - 2022/11/11 1:40 PM

Collapse comment: Thomas Jungblut added a comment - 2022/10/25 1:27 PM

Expand comment: Thomas Jungblut added a comment - 2022/10/25 1:27 PM

Collapse comment: Marius Cornea added a comment - 2022/10/21 8:17 AM

Expand comment: Marius Cornea added a comment - 2022/10/21 8:17 AM

Collapse comment: Ge Liu added a comment - 2022/10/13 3:46 AM, Edited by Ge Liu - 2022/10/13 3:47 AM

Expand comment: Ge Liu added a comment - 2022/10/13 3:46 AM, Edited by Ge Liu - 2022/10/13 3:47 AM

Collapse comment: Marius Cornea added a comment - 2022/09/15 12:37 PM

Expand comment: Marius Cornea added a comment - 2022/09/15 12:37 PM

Collapse comment: Peter Hunt added a comment - 2022/09/14 4:15 PM

Expand comment: Peter Hunt added a comment - 2022/09/14 4:15 PM

People

Dates