Uploaded image for project: 'OpenShift Bugs'
  1. OpenShift Bugs
  2. OCPBUGS-69394

Hypershift Backups fails due API timeouts

XMLWordPrintable

    • Icon: Bug Bug
    • Resolution: Unresolved
    • Icon: Normal Normal
    • None
    • 4.19
    • HyperShift
    • None
    • False
    • Hide

      None

      Show
      None
    • None
    • None
    • None
    • None
    • None
    • None
    • None
    • None
    • None
    • None
    • None
    • None
    • None
    • None

      Description of problem:

      ROSA HCP backups (based on Hypershift OADP) finish in partial failure.
      
      i.e.
      Running CMD: oc get backup 2n4sosf8o20lcbmfd5t4f57tle3jcila-bkp-test-20251212135523 -n openshift-adp -o jsonpath='{.status.phase}' --kubeconfig /Users/emathias/GitLab/ocm-backend-tests/output/.datainfo/kube-configs/2lvei7ilo7094t3embmrfbsh33q1vgpp.kubeconfig 
        Got STDOUT: PartiallyFailed
           Current Phase: PartiallyFailed (Time elapsed: 27m30s)

      Version-Release number of selected component (if applicable):

      ROSA HCP in integration:main

      How reproducible:

      Problem is observed in manual backups and/or failures from ROSA pipelines.    

      Steps to Reproduce:

      1. Trigger a backup for ROSA HCP
      2. Wait until the backup is finished.    

      Actual results:

      Backup in partial failure.   
      

      Expected results:

      Backup successful.
      

      Additional info:

      # From Lawton's investigation
      From what I can see in the logs it looks like the HostedCluster.spec.pausedUntil gets set to false. 
      
      This waits for the HO to reconcile and propagate this and there's a 2 minute timeout for this propagation. 
      
      However, the operator is rate-limited and slow to reconcile so the HC does not get updated in time. 
      
      The backup technically succeeded but failed during the cleanup when HO tried to unpause the cluster. 
      
      The pausing/unpausing of the HC occurs in a timely manner but the operator takes a while to reconcile the pause/unpause to the HCP. It still looks like a client-side load issue to me.
      
      I can see the queue. Things get backlogged.
      lmizell@compu-p1:~/Development/ocm/hack$ KUBECONFIG=/home/lmizell/Development/ocm/hack/kube_config_mc oc exec -n hypershift deployment/operator -- curl -s localhost:9000/metrics |
            grep workqueue_depth
      Defaulted container "operator" out of: operator, init-environment (init)
      # HELP workqueue_depth Current depth of workqueue
      # TYPE workqueue_depth gauge
      workqueue_depth{controller="DedicatedServingComponentSchedulerAndSizer",name="DedicatedServingComponentSchedulerAndSizer"} 0
      workqueue_depth{controller="MachineSetDescaler",name="MachineSetDescaler"} 0
      workqueue_depth{controller="NonRequestServingNodeAutoscaler",name="NonRequestServingNodeAutoscaler"} 0
      workqueue_depth{controller="PlaceholderScheduler.Creator",name="PlaceholderScheduler.Creator"} 0
      workqueue_depth{controller="PlaceholderScheduler.Updater",name="PlaceholderScheduler.Updater"} 0
      workqueue_depth{controller="RequestServingNodeAutoscaler",name="RequestServingNodeAutoscaler"} 0
      workqueue_depth{controller="ResourceBasedControlPlaneAutoscaler",name="ResourceBasedControlPlaneAutoscaler"} 24
      workqueue_depth{controller="awsendpointservice",name="awsendpointservice"} 0
      workqueue_depth{controller="configmap",name="configmap"} 0
      workqueue_depth{controller="hostedcluster",name="hostedcluster"} 13
      workqueue_depth{controller="hostedclustersizing",name="hostedclustersizing"} 15
      workqueue_depth{controller="hostedclustersizingvalidator",name="hostedclustersizingvalidator"} 0
      workqueue_depth{controller="nodepool",name="nodepool"} 44
      workqueue_depth{controller="proxy",name="proxy"} 0
      workqueue_depth{controller="secret",name="secret"} 1922
      
      # From Eric's tests
      I think during the 2min time period I was following, it made over 600+ api requests and then timed out.
      
      # From Juanma's tests
      I’ve observed that behavior already in my tests
      
      https://redhat-internal.slack.com/archives/C089VJ638AY/p1765568823545609?thread_ts=1765543655.257039&cid=C089VJ638AY
      

              jparrill@redhat.com Juan Manuel Parrilla Madrid
              lponce@redhat.com Lucas Ponce
              None
              None
              Ge Liu Ge Liu
              None
              Votes:
              0 Vote for this issue
              Watchers:
              5 Start watching this issue

                Created:
                Updated: