-
Bug
-
Resolution: Done
-
Normal
-
None
-
4.19.0
-
Quality / Stability / Reliability
-
False
-
-
None
-
Critical
-
None
-
None
-
None
-
Approved
-
None
-
None
-
None
-
None
-
None
-
None
-
None
-
None
(Feel free to update this bug's summary to be more specific.)
Component Readiness has found a potential regression in the following test:
install should succeed: overall
Significant regression detected.
Fishers Exact probability of a regression: 100.00%.
Test pass rate dropped from 100.00% to 94.08%.
Sample (being evaluated) Release: 4.19
Start Time: 2025-03-28T00:00:00Z
End Time: 2025-04-04T08:00:00Z
Success Rate: 94.08%
Successes: 286
Failures: 18
Flakes: 0
Base (historical) Release: 4.18
Start Time: 2025-03-05T00:00:00Z
End Time: 2025-04-04T08:00:00Z
Success Rate: 100.00%
Successes: 163
Failures: 0
Flakes: 0
View the test details report for additional context.
On slack Patrick found this appears to be surfacing the actual error in files like this: https://gcsweb-ci.apps.ci.l2s4.p1.openshiftapps.com/gcs/test-platform-results/logs/periodic-ci-openshift-release-master-ci-4.19-upgrade-from-stable-4.18-e2e-aws-ovn-upgrade/1907760804069904384/artifacts/e2e-aws-ovn-upgrade/ipi-install-install-stableinitial/artifacts/clusterapi_output-1743681818/AWSCluster-openshift-cluster-api-guests-ci-op-b08blp8s-b8bab-49sx7.yaml
- type: LoadBalancerReady status: "False" severity: Warning lasttransitiontime: "2025-04-03T12:02:25Z" reason: LoadBalancerFailed message: "[unexpected aws error: Throttling: Rate exceeded\n\tstatus code: 400, request id: de6c7cff-9714-45e2-8ffd-579ca1173ae3, unexpected aws error: Throttling: Rate exceeded\n\tstatus code: 400, request id: e7fe622a-ee9f-496e-aef9-663a58fa34e3]" - type: VpcEndpointsReadyCondition status: "False" severity: Warning lasttransitiontime: "2025-04-03T12:02:03Z" reason: VpcEndpointsReconciliationFailed message: "failed to create vpc endpoint for service \"com.amazonaws.us-west-2.s3\": VpcEndpointLimitExceeded: The maximum number of VPC endpoints has been reached.\n\tstatus code: 400, request id: 22727f22-ef7b-49e0-a5ec-abb7bc5f2766"
This would appear to indicate overloaded AWS accounts, assuming this is not some code change new to 4.19. I think I see it happening in past releases, but it was not in 4.18 at the time of GA thus why component readiness is seeing it.
Because it's right around a 5% regression, and we only mark you red at -5%, this has potential to appear and disappear on the board. However, it must be addressed or we could end up having to justify an intentional regression at the end of the release why it was known but not fixed. While it may not end up being a product issue, if we cannot install we cannot test, so it's still very important to get solved.
Test platform may be able to help rebalance jobs around AWS accounts, or possibly add new accounts. This specific job seems to run a lot in 4.19, perhaps even moving this single job to another aws account would help.
The failure output would be excellent to see in the installer output both for customers and TRT. Patrick reports there is a card for that that may be prioritized this sprint: https://issues.redhat.com/browse/CORS-3682