Uploaded image for project: 'OpenShift Bugs'
  1. OpenShift Bugs
  2. OCPBUGS-54619

AWS installs failing 65% of the time

XMLWordPrintable

    • Quality / Stability / Reliability
    • False
    • Hide

      None

      Show
      None
    • None
    • Critical
    • None
    • None
    • None
    • Approved
    • None
    • None
    • None
    • None
    • None
    • None
    • None
    • None

      (Feel free to update this bug's summary to be more specific.)
      Component Readiness has found a potential regression in the following test:

      install should succeed: overall

      Significant regression detected.
      Fishers Exact probability of a regression: 100.00%.
      Test pass rate dropped from 100.00% to 94.08%.

      Sample (being evaluated) Release: 4.19
      Start Time: 2025-03-28T00:00:00Z
      End Time: 2025-04-04T08:00:00Z
      Success Rate: 94.08%
      Successes: 286
      Failures: 18
      Flakes: 0

      Base (historical) Release: 4.18
      Start Time: 2025-03-05T00:00:00Z
      End Time: 2025-04-04T08:00:00Z
      Success Rate: 100.00%
      Successes: 163
      Failures: 0
      Flakes: 0

      View the test details report for additional context.

      On slack Patrick found this appears to be surfacing the actual error in files like this: https://gcsweb-ci.apps.ci.l2s4.p1.openshiftapps.com/gcs/test-platform-results/logs/periodic-ci-openshift-release-master-ci-4.19-upgrade-from-stable-4.18-e2e-aws-ovn-upgrade/1907760804069904384/artifacts/e2e-aws-ovn-upgrade/ipi-install-install-stableinitial/artifacts/clusterapi_output-1743681818/AWSCluster-openshift-cluster-api-guests-ci-op-b08blp8s-b8bab-49sx7.yaml

        - type: LoadBalancerReady
          status: "False"
          severity: Warning
          lasttransitiontime: "2025-04-03T12:02:25Z"
          reason: LoadBalancerFailed
          message: "[unexpected aws error: Throttling: Rate exceeded\n\tstatus code: 400,
            request id: de6c7cff-9714-45e2-8ffd-579ca1173ae3, unexpected aws error: Throttling:
            Rate exceeded\n\tstatus code: 400, request id: e7fe622a-ee9f-496e-aef9-663a58fa34e3]"
      
        - type: VpcEndpointsReadyCondition
          status: "False"
          severity: Warning
          lasttransitiontime: "2025-04-03T12:02:03Z"
          reason: VpcEndpointsReconciliationFailed
          message: "failed to create vpc endpoint for service \"com.amazonaws.us-west-2.s3\":
            VpcEndpointLimitExceeded: The maximum number of VPC endpoints has been reached.\n\tstatus
            code: 400, request id: 22727f22-ef7b-49e0-a5ec-abb7bc5f2766"
      

      This would appear to indicate overloaded AWS accounts, assuming this is not some code change new to 4.19. I think I see it happening in past releases, but it was not in 4.18 at the time of GA thus why component readiness is seeing it.

      Because it's right around a 5% regression, and we only mark you red at -5%, this has potential to appear and disappear on the board. However, it must be addressed or we could end up having to justify an intentional regression at the end of the release why it was known but not fixed. While it may not end up being a product issue, if we cannot install we cannot test, so it's still very important to get solved.

      Test platform may be able to help rebalance jobs around AWS accounts, or possibly add new accounts. This specific job seems to run a lot in 4.19, perhaps even moving this single job to another aws account would help.

      The failure output would be excellent to see in the installer output both for customers and TRT. Patrick reports there is a card for that that may be prioritized this sprint: https://issues.redhat.com/browse/CORS-3682 

              Unassigned Unassigned
              rhn-engineering-dgoodwin Devan Goodwin
              None
              None
              Gaoyun Pei Gaoyun Pei
              None
              Votes:
              0 Vote for this issue
              Watchers:
              10 Start watching this issue

                Created:
                Updated:
                Resolved: