-
Bug
-
Resolution: Unresolved
-
Normal
-
None
-
4.19
-
None
-
False
-
-
None
-
None
-
None
-
None
-
None
-
None
-
None
-
None
-
None
-
None
-
None
-
None
-
None
-
None
Description of problem:
ROSA HCP backups (based on Hypershift OADP) finish in partial failure.
i.e.
Running CMD: oc get backup 2n4sosf8o20lcbmfd5t4f57tle3jcila-bkp-test-20251212135523 -n openshift-adp -o jsonpath='{.status.phase}' --kubeconfig /Users/emathias/GitLab/ocm-backend-tests/output/.datainfo/kube-configs/2lvei7ilo7094t3embmrfbsh33q1vgpp.kubeconfig
Got STDOUT: PartiallyFailed
Current Phase: PartiallyFailed (Time elapsed: 27m30s)
Version-Release number of selected component (if applicable):
ROSA HCP in integration:main
How reproducible:
Problem is observed in manual backups and/or failures from ROSA pipelines.
Steps to Reproduce:
1. Trigger a backup for ROSA HCP 2. Wait until the backup is finished.
Actual results:
Backup in partial failure.
Expected results:
Backup successful.
Additional info:
# From Lawton's investigation
From what I can see in the logs it looks like the HostedCluster.spec.pausedUntil gets set to false.
This waits for the HO to reconcile and propagate this and there's a 2 minute timeout for this propagation.
However, the operator is rate-limited and slow to reconcile so the HC does not get updated in time.
The backup technically succeeded but failed during the cleanup when HO tried to unpause the cluster.
The pausing/unpausing of the HC occurs in a timely manner but the operator takes a while to reconcile the pause/unpause to the HCP. It still looks like a client-side load issue to me.
I can see the queue. Things get backlogged.
lmizell@compu-p1:~/Development/ocm/hack$ KUBECONFIG=/home/lmizell/Development/ocm/hack/kube_config_mc oc exec -n hypershift deployment/operator -- curl -s localhost:9000/metrics |
grep workqueue_depth
Defaulted container "operator" out of: operator, init-environment (init)
# HELP workqueue_depth Current depth of workqueue
# TYPE workqueue_depth gauge
workqueue_depth{controller="DedicatedServingComponentSchedulerAndSizer",name="DedicatedServingComponentSchedulerAndSizer"} 0
workqueue_depth{controller="MachineSetDescaler",name="MachineSetDescaler"} 0
workqueue_depth{controller="NonRequestServingNodeAutoscaler",name="NonRequestServingNodeAutoscaler"} 0
workqueue_depth{controller="PlaceholderScheduler.Creator",name="PlaceholderScheduler.Creator"} 0
workqueue_depth{controller="PlaceholderScheduler.Updater",name="PlaceholderScheduler.Updater"} 0
workqueue_depth{controller="RequestServingNodeAutoscaler",name="RequestServingNodeAutoscaler"} 0
workqueue_depth{controller="ResourceBasedControlPlaneAutoscaler",name="ResourceBasedControlPlaneAutoscaler"} 24
workqueue_depth{controller="awsendpointservice",name="awsendpointservice"} 0
workqueue_depth{controller="configmap",name="configmap"} 0
workqueue_depth{controller="hostedcluster",name="hostedcluster"} 13
workqueue_depth{controller="hostedclustersizing",name="hostedclustersizing"} 15
workqueue_depth{controller="hostedclustersizingvalidator",name="hostedclustersizingvalidator"} 0
workqueue_depth{controller="nodepool",name="nodepool"} 44
workqueue_depth{controller="proxy",name="proxy"} 0
workqueue_depth{controller="secret",name="secret"} 1922
# From Eric's tests
I think during the 2min time period I was following, it made over 600+ api requests and then timed out.
# From Juanma's tests
I’ve observed that behavior already in my tests
https://redhat-internal.slack.com/archives/C089VJ638AY/p1765568823545609?thread_ts=1765543655.257039&cid=C089VJ638AY