Loading...

XML

Word

Printable

Type: Bug
Resolution: Cannot Reproduce
Priority: Normal
Fix Version/s: None
Affects Version/s: 4.19
Component/s: HyperShift
Labels:
- retriage

Activity Type:
None
Blocked:
False
Blocked Reason:

Hide

None

Show
None
Story Points:
None
Severity:
None
Regression:
None

Target Backport Versions:
None
Target Version:
None
Release Blocker:
None
Sprint:
None

SFDC Cases Counter:
SFDC Cases Open:
SFDC Cases Links:

PX Impact Score:

Release Note Status:
None
Release Note Type:
None
Release Note Text:
None

Escape Reason:
None
Escape Impact:
None
Corrective Measures:
None
SDLC stage when should've been found:
None

Description of problem:

ROSA HCP backups (based on Hypershift OADP) finish in partial failure.

i.e.
Running CMD: oc get backup 2n4sosf8o20lcbmfd5t4f57tle3jcila-bkp-test-20251212135523 -n openshift-adp -o jsonpath='{.status.phase}' --kubeconfig /Users/emathias/GitLab/ocm-backend-tests/output/.datainfo/kube-configs/2lvei7ilo7094t3embmrfbsh33q1vgpp.kubeconfig 
  Got STDOUT: PartiallyFailed
     Current Phase: PartiallyFailed (Time elapsed: 27m30s)

Version-Release number of selected component (if applicable):

ROSA HCP in integration:main

How reproducible:

Problem is observed in manual backups and/or failures from ROSA pipelines.

Steps to Reproduce:

1. Trigger a backup for ROSA HCP
2. Wait until the backup is finished.

Actual results:

Backup in partial failure.

Expected results:

Backup successful.

Additional info:

# From Lawton's investigation
From what I can see in the logs it looks like the HostedCluster.spec.pausedUntil gets set to false. 

This waits for the HO to reconcile and propagate this and there's a 2 minute timeout for this propagation. 

However, the operator is rate-limited and slow to reconcile so the HC does not get updated in time. 

The backup technically succeeded but failed during the cleanup when HO tried to unpause the cluster. 

The pausing/unpausing of the HC occurs in a timely manner but the operator takes a while to reconcile the pause/unpause to the HCP. It still looks like a client-side load issue to me.

I can see the queue. Things get backlogged.
lmizell@compu-p1:~/Development/ocm/hack$ KUBECONFIG=/home/lmizell/Development/ocm/hack/kube_config_mc oc exec -n hypershift deployment/operator -- curl -s localhost:9000/metrics |
      grep workqueue_depth
Defaulted container "operator" out of: operator, init-environment (init)
# HELP workqueue_depth Current depth of workqueue
# TYPE workqueue_depth gauge
workqueue_depth{controller="DedicatedServingComponentSchedulerAndSizer",name="DedicatedServingComponentSchedulerAndSizer"} 0
workqueue_depth{controller="MachineSetDescaler",name="MachineSetDescaler"} 0
workqueue_depth{controller="NonRequestServingNodeAutoscaler",name="NonRequestServingNodeAutoscaler"} 0
workqueue_depth{controller="PlaceholderScheduler.Creator",name="PlaceholderScheduler.Creator"} 0
workqueue_depth{controller="PlaceholderScheduler.Updater",name="PlaceholderScheduler.Updater"} 0
workqueue_depth{controller="RequestServingNodeAutoscaler",name="RequestServingNodeAutoscaler"} 0
workqueue_depth{controller="ResourceBasedControlPlaneAutoscaler",name="ResourceBasedControlPlaneAutoscaler"} 24
workqueue_depth{controller="awsendpointservice",name="awsendpointservice"} 0
workqueue_depth{controller="configmap",name="configmap"} 0
workqueue_depth{controller="hostedcluster",name="hostedcluster"} 13
workqueue_depth{controller="hostedclustersizing",name="hostedclustersizing"} 15
workqueue_depth{controller="hostedclustersizingvalidator",name="hostedclustersizingvalidator"} 0
workqueue_depth{controller="nodepool",name="nodepool"} 44
workqueue_depth{controller="proxy",name="proxy"} 0
workqueue_depth{controller="secret",name="secret"} 1922

# From Eric's tests
I think during the 2min time period I was following, it made over 600+ api requests and then timed out.

# From Juanma's tests
I’ve observed that behavior already in my tests

https://redhat-internal.slack.com/archives/C089VJ638AY/p1765568823545609?thread_ts=1765543655.257039&cid=C089VJ638AY

Assignee:: Juan Manuel Parrilla Madrid

Reporter:: Lucas Ponce

QA Contact:: Ge Liu

Need Info From:: Lawton Mizell, Lucas Ponce

Votes:: 0 Vote for this issue

Watchers:: 6 Start watching this issue

Created:: 2025/12/16 10:52 AM

Updated:: 2026/02/05 10:56 AM

Resolved:: 2026/02/05 10:54 AM

Details

Description

Attachments

Easy Agile Planning Poker

Activity

People

Dates