-
Bug
-
Resolution: Unresolved
-
Major
-
None
-
4.18.z
-
Quality / Stability / Reliability
-
False
-
-
None
-
Important
-
None
-
None
-
None
-
None
-
None
-
None
-
None
-
None
-
None
-
None
-
None
-
None
Description of problem:
In Private ARO cluster with UDR (and possible other cloud deployments) CNCC fails to assign EIPs that (for whatever reason) triggers and error resonse from the underlying cloud provider. These objects remain in CloudResponseError state indefinitely instead of being redistributed to other available egress-assignable nodes.
Version-Release number of selected component (if applicable):
How reproducible:
Always in ARO
Steps to Reproduce:
1. Deploy 150 eips over 150 namespaces in an Private ARO cluster with UDR that has 3 nodes for egressip, i.e. labeled k8s.ovn.org/egress-assignable=true
2. Initiatie a rolling restart of the workers nodes
3. Because node has an annotation that states capacity is 255, but Azure has a capacity limit of 300 security rules per nic - the actual limit is reached at ca 75 eips (in this scenario)
4. CNCC ignores the response from Azure that states the error and continues to assign CloudPrivateIPConfig to the saturated node
Actual results:
- CNCC keeps assigning new IPs to a saturated node. - CloudPrivateIPConfig objects remain stuck in CloudResponseError. - No automatic redistribution to other egress-assignable nodes.
Expected results:
- CNCC should detect that thecloud provider is returning an error. - Scheduler logic should redistribute new or failing CloudPrivateIPConfig objects to other available `egress-assignable` nodes automatically.
Additional info:
We're htting this undocumented limit because on ARO, EgressIPs are being added to the backend pool, this is being adressed in OCPBUGS-57447. Regardless, CNCC should be able to detect the errors that the cloud provider is returning.