Loading...

XML

Word

Printable

Type: Bug
Resolution: Done-Errata
Priority: Normal
Fix Version/s: 4.20.0
Affects Version/s: 4.19.z, 4.20
Component/s: MicroShift
Labels:
None

Activity Type:
Quality / Stability / Reliability
Blocked:
False
Blocked Reason:

Hide

None

Show
None
Story Points:
3
Severity:
Moderate
Regression:
None

Target Backport Versions:

4.19.z
Target Version:

4.20.0
Release Blocker:
None
Sprint:
uShift Sprint 273, uShift Sprint 274
sprint_count:
2

SFDC Cases Counter:
SFDC Cases Open:
SFDC Cases Links:

PX Impact Score:

Release Note Status:
In Progress
Release Note Type:
Bug Fix
Release Note Text:

Hide
Before this update, a bare-metal node reboot caused the deployment progress deadline to be surpassed. This resulted in the greenboot health check failing with a false-positive error. With this release, the `ProgressDeadlineExceeded` condition for deployments is removed to allow time for reboots that exceed the default time limit. Now, the full timeout duration is provided for the deployment to become ready, reducing false-positive greenboot health-check errors.

Show
Before this update, a bare-metal node reboot caused the deployment progress deadline to be surpassed. This resulted in the greenboot health check failing with a false-positive error. With this release, the `ProgressDeadlineExceeded` condition for deployments is removed to allow time for reboots that exceed the default time limit. Now, the full timeout duration is provided for the deployment to become ready, reducing false-positive greenboot health-check errors.

Escape Reason:
None
Escape Impact:
None
Corrective Measures:
None
SDLC stage when should've been found:
None

Description of problem:

The problem was spotted in the footprint-and-performance nightly jobs.
Sometimes, the healthcheck would fail with "exceeded its progress deadline" for some Deployments immediately after starting the check.

---------
01:40:08.768240    2931 workloads.go:95] Waiting 10m0s for deployment/service-ca in openshift-service-ca
01:40:08.781351    2931 workloads.go:130] Failed waiting for deployment/service-ca in openshift-service-ca: deployment "service-ca" exceeded its progress deadline01:40:08.771696    2931 workloads.go:95] Waiting 10m0s for deployment/csi-snapshot-controller in kube-system
01:40:08.781421    2931 workloads.go:130] Failed waiting for deployment/csi-snapshot-controller in kube-system: deployment "csi-snapshot-controller" exceeded its progress deadline01:40:08.768227    2931 workloads.go:95] Waiting 10m0s for deployment/router-default in openshift-ingress
01:40:08.781574    2931 workloads.go:130] Failed waiting for deployment/router-default in openshift-ingress: deployment "router-default" exceeded its progress deadline01:50:09.914921   11490 workloads.go:95] Waiting 10m0s for deployment/kserve-controller-manager in redhat-ods-applications
01:50:09.922380   11490 workloads.go:130] Failed waiting for deployment/kserve-controller-manager in redhat-ods-applications: deployment "kserve-controller-manager" exceeded its progress deadline
--------


That error occurs (and short-circuits the healthcheck) when Deployment's condition "Progressing" is false with reason "ProgressDeadlineExceeded".
These deployments have default value of "progressDeadlineSeconds" which is 600.
Reboot of the bare metal node in AWS takes around 15 minutes.

Deployment is getting that condition (Progressing: false +ProgressDeadlineExceeded) because the time between $now and last time the Progressing condition was updated was greater than the deadline (600s) due to the reboot.

The solution is to remove the short-circuit exit with that error - basically ignore that condition and give whole time that healthcheck was given to wait for the Deployment.

That condition is transient or accidental, shortly after the deployment is progressing again and at the time of collecting the SOS report (couple minutes later), the MicroShift is healthy.

Version-Release number of selected component (if applicable):

MicroShift 4.19 and 4.20

How reproducible:

Low

Steps to Reproduce:

1. Start fresh MicroShift
2. Give some time to create the Deployments, but probably not too long for MicroShift to become ready.
3. Shut down the machine for more than 10 minutes.
4. Start the machine
5. Watch the greenboot-healthcheck

Actual results:

greenboot-healthcheck fails because one of the microshift's healthchecks fails with error `deployment "service-ca" exceeded its progress deadline` almost immediately after starting the healthcheck

Expected results:

healthcheck runs as long as needed within specified timeout to assert the platform is ready

Additional info:

blocks

OCPBUGS-59301 microshift healthcheck erroneously detects deployment progress timeout in certain conditions

Closed

is cloned by

OCPBUGS-59301 microshift healthcheck erroneously detects deployment progress timeout in certain conditions

Closed

links to

openshift/microshift#5179: OCPBUGS-59175: Remove the Deployment's progress timed out check

RHEA-2025:10667 Red Hat build of MicroShift 4.20.0 bug fix and enhancement update

Assignee:: Patryk Matuszak

Reporter:: Patryk Matuszak

Need Info From:: None

Contributors:: None

QA Contact:: Rama Kasturi Narra

Doc Contact:: Shauna Diaz

Votes:: 0 Vote for this issue

Watchers:: 5 Start watching this issue

Created:: 2025/07/10 9:20 AM

Updated:: 2025/10/21 7:15 PM

Resolved:: 2025/10/21 7:15 PM

Details

Description

Attachments

Issue Links

Easy Agile Planning Poker

Activity

People

Dates

Hide