Uploaded image for project: 'OpenShift Bugs'
  1. OpenShift Bugs
  2. OCPBUGS-59301

microshift healthcheck erroneously detects deployment progress timeout in certain conditions

XMLWordPrintable

    • Quality / Stability / Reliability
    • False
    • Hide

      None

      Show
      None
    • 1
    • Moderate
    • None
    • None
    • None
    • uShift Sprint 274
    • 1
    • In Progress
    • Bug Fix
    • Hide
      *Cause*: MicroShift host is shutdown for more than 10 minutes.
      *Consequence*: Upon start, the healthcheck could erroneously fail because of faulty Deployment progression logic
      *Fix*: Faulty logic was removed
      *Result*: Bug doesn’t present anymore.
      Show
      *Cause*: MicroShift host is shutdown for more than 10 minutes. *Consequence*: Upon start, the healthcheck could erroneously fail because of faulty Deployment progression logic *Fix*: Faulty logic was removed *Result*: Bug doesn’t present anymore.
    • None
    • None
    • None
    • None

      This is a clone of issue OCPBUGS-59175. The following is the description of the original issue:

      Description of problem:

      The problem was spotted in the footprint-and-performance nightly jobs.
      Sometimes, the healthcheck would fail with "exceeded its progress deadline" for some Deployments immediately after starting the check.
      
      ---------
      01:40:08.768240    2931 workloads.go:95] Waiting 10m0s for deployment/service-ca in openshift-service-ca
      01:40:08.781351    2931 workloads.go:130] Failed waiting for deployment/service-ca in openshift-service-ca: deployment "service-ca" exceeded its progress deadline01:40:08.771696    2931 workloads.go:95] Waiting 10m0s for deployment/csi-snapshot-controller in kube-system
      01:40:08.781421    2931 workloads.go:130] Failed waiting for deployment/csi-snapshot-controller in kube-system: deployment "csi-snapshot-controller" exceeded its progress deadline01:40:08.768227    2931 workloads.go:95] Waiting 10m0s for deployment/router-default in openshift-ingress
      01:40:08.781574    2931 workloads.go:130] Failed waiting for deployment/router-default in openshift-ingress: deployment "router-default" exceeded its progress deadline01:50:09.914921   11490 workloads.go:95] Waiting 10m0s for deployment/kserve-controller-manager in redhat-ods-applications
      01:50:09.922380   11490 workloads.go:130] Failed waiting for deployment/kserve-controller-manager in redhat-ods-applications: deployment "kserve-controller-manager" exceeded its progress deadline
      --------
      
      
      That error occurs (and short-circuits the healthcheck) when Deployment's condition "Progressing" is false with reason "ProgressDeadlineExceeded".
      These deployments have default value of "progressDeadlineSeconds" which is 600.
      Reboot of the bare metal node in AWS takes around 15 minutes.
      
      Deployment is getting that condition (Progressing: false +ProgressDeadlineExceeded) because the time between $now and last time the Progressing condition was updated was greater than the deadline (600s) due to the reboot.
      
      The solution is to remove the short-circuit exit with that error - basically ignore that condition and give whole time that healthcheck was given to wait for the Deployment.
      
      That condition is transient or accidental, shortly after the deployment is progressing again and at the time of collecting the SOS report (couple minutes later), the MicroShift is healthy.
      
      

      Version-Release number of selected component (if applicable):

      MicroShift 4.19 and 4.20

      How reproducible:

      Low

      Steps to Reproduce:

      1. Start fresh MicroShift
      2. Give some time to create the Deployments, but probably not too long for MicroShift to become ready.
      3. Shut down the machine for more than 10 minutes.
      4. Start the machine
      5. Watch the greenboot-healthcheck

      Actual results:

      greenboot-healthcheck fails because one of the microshift's healthchecks fails with error `deployment "service-ca" exceeded its progress deadline` almost immediately after starting the healthcheck

      Expected results:

      healthcheck runs as long as needed within specified timeout to assert the platform is ready

      Additional info:

       

              pmatusza@redhat.com Patryk Matuszak
              pmatusza@redhat.com Patryk Matuszak
              None
              None
              Rama Kasturi Narra Rama Kasturi Narra
              None
              Votes:
              0 Vote for this issue
              Watchers:
              5 Start watching this issue

                Created:
                Updated:
                Resolved: