Uploaded image for project: 'OpenShift Bugs'
  1. OpenShift Bugs
  2. OCPBUGS-10353

kube-apiserver not receiving or processing shutdown signal after coreos 9.2 bump

XMLWordPrintable

      Description of problem:

      After coreos 9.2 bump, we are seeing disruptions happening during node upgrade that affects all backends going through kube-apiserver. Our investigation reveals that we are missing shutdown signals on the kube-apiserver during node reboot. This can be observed with most if not all micro upgrades.
      
      My analysis is from comparison between the following two jobs:
      
      pre-9.2 4.13 job: https://prow.ci.openshift.org/view/gs/origin-ci-test/logs/periodic-ci-openshift-release-master-nightly-4.13-e2e-aws-sdn-upgrade/1634235926655799296
      
      post-9.2 4.14 job: https://prow.ci.openshift.org/view/gs/origin-ci-test/logs/periodic-ci-openshift-release-master-nightly-4.14-e2e-aws-sdn-upgrade/1635210801117663232
      
      From the Spyglass Chart we clearly see disruptions with the 4.14 job during node upgrade. The first disruption happens at 10:57:26, one second after the node 10.0.232.149 started rebooting at 10:57.25. Shutdown signal is missing from the event (see attached) and kube-apiserver log: https://grafana-loki.ci.openshift.org/explore?orgId=1&left=%7B%22datasource%22:%22Grafana%20Cloud%22,%22queries%22:%5B%7B%22expr%22:%22%7Binvoker%3D%5C%22openshift-internal-ci%2Fperiodic-ci-openshift-release-master-nightly-4.14-e2e-aws-sdn-upgrade%2F1635210801117663232%5C%22%7D%20%7C%20unpack%20%20%7C%20pod_name%3D%5C%22kube-apiserver-ip-10-0-232-149.us-west-1.compute.internal%5C%22%22,%22refId%22:%22A%22%7D%5D,%22range%22:%7B%22from%22:%221678099663000%22,%22to%22:%221678705170000%22%7D%7D
      
      
      For comparison in the pre-9.2 4.13 job, node reboot started at 18:03:45 for 10.0.220.42. And shutdown signal is observed in both events (see attached) and kube-apiserver log.
      
      2023-03-10 18:03:46	
      I0310 18:03:45.807438      15 genericapiserver.go:978] Event(v1.ObjectReference{Kind:"Pod", Namespace:"openshift-kube-apiserver", Name:"kube-apiserver-ip-10-0-220-42.us-west-1.compute.internal", UID:"", APIVersion:"v1", ResourceVersion:"", FieldPath:""}): type: 'Normal' reason: 'ShutdownInitiated' Received signal to terminate, becoming unready, but keeping serving 
      
      
      It is my understanding that proper shutdown signal is needed for readyz check to return false. Without it, load balancer will fail to remove the end point and result in disruption.
      
      The same can be observed in post-9.2 4.13 jobs and 4.14 azure micro upgrade jobs.

      Version-Release number of selected component (if applicable):

       

      How reproducible:

       

      Steps to Reproduce:

      1.
      2.
      3.
      

      Actual results:

       

      Expected results:

       

      Additional info:

       

              mnguyen@redhat.com Michael Nguyen
              kenzhang@redhat.com Ken Zhang
              Sunil Choudhary Sunil Choudhary
              Votes:
              0 Vote for this issue
              Watchers:
              14 Start watching this issue

                Created:
                Updated:
                Resolved: