Uploaded image for project: 'OpenShift Bugs'
  1. OpenShift Bugs
  2. OCPBUGS-60557

Hypershift CPO making too many route53 calls and throttling AWS APIs

XMLWordPrintable

    • Quality / Stability / Reliability
    • False
    • Hide

      None

      Show
      None
    • None
    • None
    • None
    • None
    • None
    • None
    • None
    • None
    • None
    • None
    • None
    • None
    • None
    • None

      Description of problem:

      During a recent PerfScale run in Stage, we conducted a capacity test by creating over 100 HCPs in a single AWS account. This resulted in AWS API throttle/rate-exceeded errors, preventing further cluster creation. Upon investigation with AWS, it appears that the Control Plane Operator is generating excessive API calls, causing this slowdown. 
      
      Creating this card to investigate the validity of the issue and explore improvements to the internal mechanisms to prevent overloading the cloud API.
      
      AWS responded
      "can see an unusually high volume of ListHostedZones API calls being made by the IAM role: arniam::415909267177:role/p4-alex2-0185-l4h0-kube-system-control-plane-operator"
      
      Cluster Service logs on new cluster creation
      2025-08-14T21:07:54.66386601Z ERROR handler_helpers.go:249 [opid='e60cb7fd-6bbd-4e77-9032-95606c15781e'] Can't process 'POST /api/clusters_mgmt/v1/clusters': operation error Route 53: ListHostedZonesByName, exceeded maximum number of attempts, 8, https response error StatusCode: 400, RequestID: efa26dd0-b835-4460-bd03-ed6192ec4829, api error Throttling: Rate exceeded
      gitlab.cee.redhat.com/service/uhc-clusters-service/pkg/logging.(*Logger).Error

      Version-Release number of selected component (if applicable):

      4.19.7    

      How reproducible:

      Always

      Steps to Reproduce:

      1. Create 100 HCPs in the same AWS account
      2. Try `aws route53 list-hosted-zones | jq -r '.HostedZones[] | select(.Name|test("p4.."))' 
      3. This will not return anything with this error `An error occurred (Throttling) when calling the ListHostedZones operation (reached max retries: 2): Rate exceeded`
           

      Actual results:

      AWS API throttles, CLI would not respond, fails cluster creation(be it self-managed or ROSA Classic or ROSA-HCP) in that account, including independent aws route53 create-hosted-zone requests as well.

      Expected results:

      HCP control-plane-operator should not load/throttle cloud API with too many calls

      Additional info:

      It can also occur with fewer HCPs when created on a single AWS account, as the load appears to accumulate over time.

              Unassigned Unassigned
              mukrishn@redhat.com Murali Krishnasamy
              None
              None
              Jie Zhao Jie Zhao
              None
              Votes:
              0 Vote for this issue
              Watchers:
              4 Start watching this issue

                Created:
                Updated: