-
Bug
-
Resolution: Unresolved
-
Normal
-
4.19.z, 4.20
-
None
Description of problem:
The problem was spotted in the footprint-and-performance nightly jobs. Sometimes, the healthcheck would fail with "exceeded its progress deadline" for some Deployments immediately after starting the check. --------- 01:40:08.768240 2931 workloads.go:95] Waiting 10m0s for deployment/service-ca in openshift-service-ca 01:40:08.781351 2931 workloads.go:130] Failed waiting for deployment/service-ca in openshift-service-ca: deployment "service-ca" exceeded its progress deadline01:40:08.771696 2931 workloads.go:95] Waiting 10m0s for deployment/csi-snapshot-controller in kube-system 01:40:08.781421 2931 workloads.go:130] Failed waiting for deployment/csi-snapshot-controller in kube-system: deployment "csi-snapshot-controller" exceeded its progress deadline01:40:08.768227 2931 workloads.go:95] Waiting 10m0s for deployment/router-default in openshift-ingress 01:40:08.781574 2931 workloads.go:130] Failed waiting for deployment/router-default in openshift-ingress: deployment "router-default" exceeded its progress deadline01:50:09.914921 11490 workloads.go:95] Waiting 10m0s for deployment/kserve-controller-manager in redhat-ods-applications 01:50:09.922380 11490 workloads.go:130] Failed waiting for deployment/kserve-controller-manager in redhat-ods-applications: deployment "kserve-controller-manager" exceeded its progress deadline -------- That error occurs (and short-circuits the healthcheck) when Deployment's condition "Progressing" is false with reason "ProgressDeadlineExceeded". These deployments have default value of "progressDeadlineSeconds" which is 600. Reboot of the bare metal node in AWS takes around 15 minutes. Deployment is getting that condition (Progressing: false +ProgressDeadlineExceeded) because the time between $now and last time the Progressing condition was updated was greater than the deadline (600s) due to the reboot. The solution is to remove the short-circuit exit with that error - basically ignore that condition and give whole time that healthcheck was given to wait for the Deployment. That condition is transient or accidental, shortly after the deployment is progressing again and at the time of collecting the SOS report (couple minutes later), the MicroShift is healthy.
Version-Release number of selected component (if applicable):
MicroShift 4.19 and 4.20
How reproducible:
Low
Steps to Reproduce:
1. Start fresh MicroShift 2. Give some time to create the Deployments, but probably not too long for MicroShift to become ready. 3. Shut down the machine for more than 10 minutes. 4. Start the machine 5. Watch the greenboot-healthcheck
Actual results:
greenboot-healthcheck fails because one of the microshift's healthchecks fails with error `deployment "service-ca" exceeded its progress deadline` almost immediately after starting the healthcheck
Expected results:
healthcheck runs as long as needed within specified timeout to assert the platform is ready
Additional info:
- blocks
-
OCPBUGS-59301 microshift healthcheck erroneously detects deployment progress timeout in certain conditions
-
- Closed
-
- is cloned by
-
OCPBUGS-59301 microshift healthcheck erroneously detects deployment progress timeout in certain conditions
-
- Closed
-
- links to
-
RHEA-2025:10667 Red Hat build of MicroShift 4.20.0 bug fix and enhancement update