Loading...

XML

Word

Printable

Type: Bug
Resolution: Obsolete
Priority: Undefined
Fix Version/s: None
Affects Version/s: None
Labels:
None

Activity Type:
Quality / Stability / Reliability
Blocked:
False
Blocked Reason:

Hide

None

Show
None
Story Points:
None
Severity:
None

Target Version:
None
Release Blocker:
None
Sprint:
None

SFDC Cases Counter:
SFDC Cases Open:
SFDC Cases Links:

Release Note Status:
None
Release Note Type:
None
Release Note Text:
None

Our e2e-pool periodic failed this week. In this log we see the hibernation controller polling ClusterOperators on resume and repeatedly reporting:

time="2023-12-19T05:50:36.105Z" level=info msg="ClusterOperator is in undesired state" clusterDeployment=hiveci-65811c00-0-h8gtx/hiveci-65811c00-0-h8gtx clusterOperator=console condition=Degraded controller=hibernation reconcileID=sx9s45gv status=True
time="2023-12-19T05:50:36.105Z" level=info msg="ClusterOperator is in undesired state" clusterDeployment=hiveci-65811c00-0-h8gtx/hiveci-65811c00-0-h8gtx clusterOperator=console condition=Progressing controller=hibernation reconcileID=sx9s45gv status=True
time="2023-12-19T05:50:36.105Z" level=info msg="ClusterOperator is in undesired state" clusterDeployment=hiveci-65811c00-0-h8gtx/hiveci-65811c00-0-h8gtx clusterOperator=console condition=Available controller=hibernation reconcileID=sx9s45gv status=False

This card is a request for QE to attempt to recreate the problem. If reproducible, assign this bug to the console team in OCP engineering for a closer look.

Notes, in no particular order:

Here is the job configuration.
The spoke cluster was running 4.15.0-0.nightly-2023-12-18-220750 on AWS.
This job succeeded last week (and the previous 13 weeks). So if there's an actual bug, it was likely introduced fairly recently. However, it is provably intermittent: Hive successfully hibernated and resumed once in this job, and the problem happened on the second resume. So it's not impossible that the problem has existed for some time and we were just lucky in earlier runs. And you may need to repeat the hibernate/resume cycle multiple times to get a repro.
The issue is unlikely to be related to hive itself, as all hive is doing is shutting down the cloud instances and then starting them up again. To isolate the problem, consider installing without hive and using the cloud console to stop and start instances. (As such, it may be more appropriate for a different QE team – if so, please reassign accordingly.)
Hive gives up after 1200s (we poll 120x/10s). It's not impossible that console would have come back up if we waited longer. But 20m seems like long enough

relates to

OCPBUGS-25976 console operator takes too long to clean up failed status

Verified

Assignee:: Jianping Shu

Reporter:: Eric Fried

Need Info From:: None

Contributors:: None

Architect:: None

QA Contact:: Jianping Shu

Doc Contact:: None

Votes:: 0 Vote for this issue

Watchers:: 3 Start watching this issue

Created:: 2023/12/20 5:23 PM

Updated:: 2025/07/29 11:52 PM

Resolved:: 2024/01/19 12:02 AM

Details

Description

Attachments

Issue Links

Easy Agile Planning Poker

Activity

People

Dates