Uploaded image for project: 'OpenShift Bugs'
  1. OpenShift Bugs
  2. OCPBUGS-75200

kubelet terminates kube-apiserver gracefully failing consistently in several jobs

    • None
    • False
    • Hide

      None

      Show
      None
    • None
    • None
    • None
    • Approved
    • None
    • None
    • None
    • None
    • None
    • None
    • None
    • None

      Component Readiness has found multiple regressions for following test in a number of jobs and presubmits. 4.21 appears unaffected, this looks like a net new regression, but appearing only in a few targeted jobs and we're unclear why.

      [sig-api-machinery][Feature:APIServer][Late] kubelet terminates kube-apiserver gracefully extended [Suite:openshift/conformance/parallel]

      Here is a targeted list of job runs failing only on this test

      If I navigate that list of jobs far enough, this started clear as day on Jan 15th, but seemingly just in a rosa job. (that job has been broken for weeks for other reasons so it's not showing up anymore), and a smattering of multi-a-a job hits. (multiarch)

      The failures get more diverse around the 19th now showing up in a number of more normal aws/gcp jobs.

      Here are a number of jobs for investigation, but use the list above if you want more:

      https://prow.ci.openshift.org/view/gs/test-platform-results/logs/periodic-ci-openshift-release-master-nightly-4.22-e2e-aws-ovn-upgrade-fips/2013287064870588416

      • Logs
      • This one looks like it was during upgrade.

      https://prow.ci.openshift.org/view/gs/test-platform-results/logs/periodic-ci-openshift-release-master-nightly-4.22-e2e-aws-ovn-upgrade-fips-no-nat-instance/2013341758783492096

      https://prow.ci.openshift.org/view/gs/test-platform-results/logs/periodic-ci-openshift-machine-config-operator-release-4.22-periodics-e2e-aws-ovn-ocl/2013718712518971392

      • Logs
      • Logged around 22:20 - 22:34, once per apiserver.
      • Given Sergio's comment, is this operation triggering the rollout that fails to terminate the pod properly? Did their test do something wrong?

      https://prow.ci.openshift.org/view/gs/test-platform-results/logs/periodic-ci-openshift-release-master-nightly-4.22-e2e-gcp-custom-dns/2019029442176749568

      Note the mix of upgrade and non-upgrade jobs. Why would we be getting apiserver termination in non-upgrade jobs?

      A number of presubmits are also failing very frequently on this test, as are a number of other specific presubmit jobs.

      The test unfortunately doesn't log WHEN it observed the problem, but this data is visible in the kube api logs it scraped. Go to Debug Tools > Loki and then append this to the search string:

       |~ "terminate gracefully"

              sgrunert@redhat.com Sascha Grunert
              openshift-trt OpenShift Technical Release Team
              None
              Votes:
              0 Vote for this issue
              Watchers:
              10 Start watching this issue

                Created:
                Updated: