Uploaded image for project: 'OpenShift Bugs'
  1. OpenShift Bugs
  2. OCPBUGS-1329

etcd and kube-apiserver pods get restarted due to failed liveness probes while deleting/re-creating pods on SNO

    • Icon: Bug Bug
    • Resolution: Done
    • Icon: Normal Normal
    • 4.12.0
    • 4.12.0, 4.11, 4.10.z, 4.9.z, 4.8.z
    • Etcd
    • Critical
    • None
    • ETCD Sprint 225, ETCD Sprint 226
    • 2
    • Proposed
    • False
    • Hide

      None

      Show
      None

      Description of problem:

      etcd and kube-apiserver pods get restarted due to failed liveness probes while deleting/re-creating pods on SNO

      Version-Release number of selected component (if applicable):

      4.10.32

      How reproducible:

      Not always, after ~10 attempts

      Steps to Reproduce:

      1. Deploy SNO with Telco DU profile applied
      2. Create multiple pods with local storage volumes attached(attaching yaml manifest)
      3. Force delete and re-create pods 10 times
      

      Actual results:

      etcd and kube-apiserver pods get restarted, making to cluster unavailable for a period of time

      Expected results:

      etcd and kube-apiserver do not get restarted

      Additional info:

      Attaching must-gather.
      
      Please let me know if any additional info is required. Thank you!

            [OCPBUGS-1329] etcd and kube-apiserver pods get restarted due to failed liveness probes while deleting/re-creating pods on SNO

            Francois Rigault added a comment - - edited

            hi tjungblu@redhat.com  geliu  how did you validate this change?

            in the pr you mention a test with sha1sum, was it running within the etcd container, and so the niceness has any effect?

            Shouldn't the CPU scheduling be more influenced with the CPUWeight derived from the CPU request, that today is hardcoded at 300m: 

            https://github.com/openshift/cluster-etcd-operator/blob/5223e3752616e8f3906254bfeccf3b75ce459872/bindata/etcd/pod.yaml#L196

            My understanding is setting the niceness should not have any effect. In the yaml file attached there are multiple containers all running with larger requests, the scheduler will use that and give less weight to eg etcd.

            Francois Rigault added a comment - - edited hi tjungblu@redhat.com   geliu   how did you validate this change? in the pr you mention a test with sha1sum, was it running within the etcd container, and so the niceness has any effect? Shouldn't the CPU scheduling be more influenced with the CPUWeight derived from the CPU request, that today is hardcoded at 300m:  https://github.com/openshift/cluster-etcd-operator/blob/5223e3752616e8f3906254bfeccf3b75ce459872/bindata/etcd/pod.yaml#L196 My understanding is setting the niceness should not have any effect. In the yaml file attached there are multiple containers all running with larger requests, the scheduler will use that and give less weight to eg etcd.

            Errata Tool added a comment -

            Since the problem described in this issue should be resolved in a recent advisory, it has been closed.

            For information on the advisory, and where to find the updated files, follow the link below.

            If the solution does not work for you, open a new bug report.
            https://access.redhat.com/errata/RHSA-2022:7399

            Errata Tool added a comment - Since the problem described in this issue should be resolved in a recent advisory, it has been closed. For information on the advisory, and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHSA-2022:7399

            I observed the issue with 4.10.40 on Dell server. 

             

            fatal: [localhost]: FAILED! => {"changed": true, "cmd": ["oc", "--kubeconfig", "/tmp/cluster-auth/cloudransno-site1/kubeconfig", "get", "pods", "-n", "test", "-o", "wide"], "delta": "0:00:00.053406", "end": "2022-11-10 21:14:06.403229", "msg": "non-zero return code", "rc": 1, "start": "2022-11-10 21:14:06.349823", "stderr": "The connection to the server api.cloudransno-site1.ecr1.bos2.lab:6443 was refused - did you specify the right host or port?", "stderr_lines": ["The connection to the server api.cloudransno-site1.ecr1.bos2.lab:6443 was refused - did you specify the right host or port?"], "stdout": "",

            Jennifer Chen added a comment - I observed the issue with 4.10.40 on Dell server.    fatal: [localhost] : FAILED! => {"changed": true, "cmd": ["oc", "--kubeconfig", "/tmp/cluster-auth/cloudransno-site1/kubeconfig", "get", "pods", "-n", "test", "-o", "wide"] , "delta": "0:00:00.053406", "end": "2022-11-10 21:14:06.403229", "msg": "non-zero return code", "rc": 1, "start": "2022-11-10 21:14:06.349823", "stderr": "The connection to the server api.cloudransno-site1.ecr1.bos2.lab:6443 was refused - did you specify the right host or port?", "stderr_lines": ["The connection to the server api.cloudransno-site1.ecr1.bos2.lab:6443 was refused - did you specify the right host or port?"] , "stdout": "",

            sure.

            Thomas Jungblut added a comment - sure.

            I wasn't able to reproduce the issue with a recent nigthly 4.12 build(4.12.0-0.nightly-2022-10-18-192348). I am also seeing etcd process niceness set according the code change so moving to verified.

                PID  NI CMD
               6809 -19 etcd --logger=zap --log-level=info --experimental-initial-corrupt-check=true --initial-advertise-peer-urls=https://[2620:52:0:198::10]:2380 --cert-file=/etc/kubernetes/static-pod-certs/secrets/etcd-all-certs/etcd-serving-sno.kni-qe-1.lab.eng.rdu2.redhat.com.crt --key-file=/etc/kubernetes/static-pod-certs/secrets/etcd-all-certs/et
               6873 18 etcd grpc-proxy start --endpoints https://[2620:52:0:198::10]:9978 --metrics-addr https://0.0.0.0:9979 --listen-addr 127.0.0.1:9977 --advertise-client-url  -key /etc/kubernetes/static-pod-certs/secrets/etcd-all-certs/etcd-peer-sno.kni-qe-1.lab.eng.rdu2.redhat.com.key --key-file /etc/kubernetes/static-pod-certs/secrets/etcd-all-c

             

             

            Marius Cornea added a comment - I wasn't able to reproduce the issue with a recent nigthly 4.12 build(4.12.0-0.nightly-2022-10-18-192348). I am also seeing etcd process niceness set according the code change so moving to verified.     PID  NI CMD    6809 -19 etcd --logger=zap --log-level=info --experimental-initial-corrupt-check=true --initial-advertise-peer-urls= https://[2620:52:0:198::10]:2380 --cert-file=/etc/kubernetes/static-pod-certs/secrets/etcd-all-certs/etcd-serving-sno.kni-qe-1.lab.eng.rdu2.redhat.com.crt --key-file=/etc/kubernetes/static-pod-certs/secrets/etcd-all-certs/et    6873 18 etcd grpc-proxy start --endpoints https://[2620:52:0:198::10]:9978 --metrics-addr https://0.0.0.0:9979 --listen-addr 127.0.0.1:9977 --advertise-client-url   -key /etc/kubernetes/static-pod-certs/secrets/etcd-all-certs/etcd-peer-sno.kni-qe-1.lab.eng.rdu2.redhat.com.key --key-file /etc/kubernetes/static-pod-certs/secrets/etcd-all-c    

            Ge Liu added a comment - - edited

            Hello mcornea@redhat.com, browsell@redhat.comWe can't deploy cluster with Telco DU profile applied, could you please help to verify it or provide some instrunction to me how to deploy it, thanks

            Ge Liu added a comment - - edited Hello mcornea@redhat.com , browsell@redhat.com We can't deploy cluster with Telco DU profile applied, could you please help to verify it or provide some instrunction to me how to deploy it, thanks

            Adding another data point: the issue was observed on 4.9.48 version as well. 

            Marius Cornea added a comment - Adding another data point: the issue was observed on 4.9.48 version as well. 

            Peter Hunt added a comment -

            Harshal, can you PTAL?

            (often getting the goroutine stacks https://github.com/cri-o/cri-o/blob/main/tutorials/debugging.md#printing-go-routines helps with these cases, can folks grab those?)

            Peter Hunt added a comment - Harshal, can you PTAL? (often getting the goroutine stacks https://github.com/cri-o/cri-o/blob/main/tutorials/debugging.md#printing-go-routines helps with these cases, can folks grab those?)

              tjungblu@redhat.com Thomas Jungblut
              mcornea@redhat.com Marius Cornea
              Ge Liu Ge Liu
              Votes:
              0 Vote for this issue
              Watchers:
              16 Start watching this issue

                Created:
                Updated:
                Resolved: