Uploaded image for project: 'OpenShift Bugs'
  1. OpenShift Bugs
  2. OCPBUGS-2800

[4.10] etcd and kube-apiserver pods get restarted due to failed liveness probes while deleting/re-creating pods on SNO

    • Icon: Bug Bug
    • Resolution: Done
    • Icon: Minor Minor
    • None
    • 4.12.0, 4.11, 4.10.z, 4.9.z, 4.8.z
    • Etcd
    • None
    • Low
    • None
    • 3
    • ETCD Sprint 227
    • 1
    • Rejected
    • False
    • Hide

      None

      Show
      None

      This is a clone of issue OCPBUGS-2113. The following is the description of the original issue:
      โ€”
      This is a clone of issue OCPBUGS-1329. The following is the description of the original issue:
      โ€”
      Description of problem:

      etcd and kube-apiserver pods get restarted due to failed liveness probes while deleting/re-creating pods on SNO

      Version-Release number of selected component (if applicable):

      4.10.32

      How reproducible:

      Not always, after ~10 attempts

      Steps to Reproduce:

      1. Deploy SNO with Telco DU profile applied
      2. Create multiple pods with local storage volumes attached(attaching yaml manifest)
      3. Force delete and re-create pods 10 times
      

      Actual results:

      etcd and kube-apiserver pods get restarted, making to cluster unavailable for a period of time

      Expected results:

      etcd and kube-apiserver do not get restarted

      Additional info:

      Attaching must-gather.
      
      Please let me know if any additional info is required. Thank you!

            [OCPBUGS-2800] [4.10] etcd and kube-apiserver pods get restarted due to failed liveness probes while deleting/re-creating pods on SNO

            Thanks for the answer. I finally opened: https://issues.redhat.com/browse/OCPBUGS-3870

            I cannot tell about ETCD, I see it has been restarted though. How can I check etcd unavailability?

            Rodrigo Lopez Manrique (Inactive) added a comment - Thanks for the answer. I finally opened: https://issues.redhat.com/browse/OCPBUGS-3870 I cannot tell about ETCD, I see it has been restarted though. How can I check etcd unavailability?

            rlopezma@redhat.com keep in mind that we've not fixed the API unavailability, but the one of etcd. Do you see etcd being unavailable?

            Thomas Jungblut added a comment - rlopezma@redhat.com keep in mind that we've not fixed the API unavailability, but the one of etcd. Do you see etcd being unavailable?

            Hi,

            I hit it as well on 4.10.40 version.

            The steps have been:

            • Deploy eDU pods
            • Reboot SNO server
            • Redeploy eDU pods
            • API unavailable for a few minutes

            It is random, and it is the first time I hit the error after a big number of repetitions, ~50.

            I will attach oc adm must-gather for the record, but I will open a new bug

            Rodrigo Lopez Manrique (Inactive) added a comment - - edited Hi, I hit it as well on 4.10.40 version. The steps have been: Deploy eDU pods Reboot SNO server Redeploy eDU pods API unavailable for a few minutes It is random, and it is the first time I hit the error after a big number of repetitions, ~50. I will attach oc adm must-gather for the record, but I will open a new bug

            geliu We tested against 4.10.40 (hub cluster 4.10.40 and SNO 4.10.40), we observed issue at almost end of the pipelines. It happened occasionally (not always).

            1. The test deploys and un-deploys the eDu
            2. Deploy edu and gracefully reboot the SNO and un-deploy the edu
            3. Deploy edu and power cycled the SNO and un-deploy the edu.
            4. Keep deploying and undeploying the eDUs for cycles.
            5. Graceful reboot the SNO
            6. Power cycle the SNO
            7. Keep deploying and undeploying the eDu for cycles and then use Prometheus to query CPU, and Memory usage. 

            We observed the issue 3 times. 1st and 2nd times it happened at step 7. 3rd time happen at step 4 when I increased the cycle to a big number. 

             

            Currently I do not have ocp log but ansible logs. Please let me know if you want ansible logs. 

             

            Jennifer Chen added a comment - geliu We tested against 4.10.40 (hub cluster 4.10.40 and SNO 4.10.40), we observed issue at almost end of the pipelines. It happened occasionally (not always). The test deploys and un-deploys the eDu Deploy edu and gracefully reboot the SNO and un-deploy the edu Deploy edu and power cycled the SNO and un-deploy the edu. Keep deploying and undeploying the eDUs for cycles. Graceful reboot the SNO Power cycle the SNO Keep deploying and undeploying the eDu for cycles and then use Prometheus to query CPU, and Memory usage.  We observed the issue 3 times. 1st and 2nd times it happened at step 7. 3rd time happen at step 4 when I increased the cycle to a big number.    Currently I do not have ocp log but ansible logs. Please let me know if you want ansible logs.   

            Ge Liu added a comment -

            jenchen@redhat.comcould you please provide more info about your reproduce steps and must-gather log if possible, thanks!

            Ge Liu added a comment - jenchen@redhat.com could you please provide more info about your reproduce steps and must-gather log if possible, thanks!

            I observed the issue with 4.10.40 on Dell server. 

             

            fatal: [localhost]: FAILED! => {"changed": true, "cmd": ["oc", "--kubeconfig", "/tmp/cluster-auth/cloudransno-site1/kubeconfig", "get", "pods", "-n", "test", "-o", "wide"], "delta": "0:00:00.053406", "end": "2022-11-10 21:14:06.403229", "msg": "non-zero return code", "rc": 1, "start": "2022-11-10 21:14:06.349823", "stderr": "The connection to the server api.cloudransno-site1.ecr1.bos2.lab:6443 was refused - did you specify the right host or port?", "stderr_lines": ["The connection to the server api.cloudransno-site1.ecr1.bos2.lab:6443 was refused - did you specify the right host or port?"], "stdout": "",

            Jennifer Chen added a comment - I observed the issue with 4.10.40 on Dell server.    fatal:  [localhost] : FAILED! => {"changed": true, "cmd":  ["oc", "--kubeconfig", "/tmp/cluster-auth/cloudransno-site1/kubeconfig", "get", "pods", "-n", "test", "-o", "wide"] , "delta": "0:00:00.053406", "end": "2022-11-10 21:14:06.403229", "msg": "non-zero return code", "rc": 1, "start": "2022-11-10 21:14:06.349823", "stderr": "The connection to the server api.cloudransno-site1.ecr1.bos2.lab:6443 was refused - did you specify the right host or port?", "stderr_lines":  ["The connection to the server api.cloudransno-site1.ecr1.bos2.lab:6443 was refused - did you specify the right host or port?"] , "stdout": "",

            Since the problem described in this issue should be resolved in a recent advisory, it has been closed.

            For information on the advisory, and where to find the updated files, follow the link below.

            If the solution does not work for you, open a new bug report.
            https://access.redhat.com/errata/RHBA-2022:7298

            Errata Tool added a comment - Since the problem described in this issue should be resolved in a recent advisory, it has been closed. For information on the advisory, and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHBA-2022:7298

            Ge Liu added a comment -

            Verified with 4.10.0-0.nightly-2022-11-03-040152, deploy a sno cluster, install local storage operator, create pv, create pods with pv attatched, delete and recreate pods several time, check the pods. 

            Ge Liu added a comment - Verified with 4.10.0-0.nightly-2022-11-03-040152, deploy a sno cluster, install local storage operator, create pv, create pods with pv attatched, delete and recreate pods several time, check the pods. 

            reducing the prio here to avoid the bot setting the blocker flags

            Thomas Jungblut added a comment - reducing the prio here to avoid the bot setting the blocker flags

              tjungblu@redhat.com Thomas Jungblut
              openshift-crt-jira-prow OpenShift Prow Bot
              Ge Liu Ge Liu
              Votes:
              0 Vote for this issue
              Watchers:
              10 Start watching this issue

                Created:
                Updated:
                Resolved: