-
Task
-
Resolution: Done
-
Major
-
None
-
None
-
None
This floods Fleet Manager with pending requests, and once the data plane cluster becomes available again, it results in a dramatic increase in resource consumption and numerous alerts because the cluster is unable to handle this load.
Problems:
- Cleanup is called after the instance creation. If the dataplane is unavailable, instance can be neither provisioned nor deleted, so even if the cleanup task fails, the probe will create a new one on the next iteration.
- If multiple instances need to be deleted, the cleanup task waits until the first instance is deleted and then fails to execute the others by timeout. This leaves (N-1) instances pending, but they can at least be marked as deprovisioning so that fleetshard sync doesn't create resources for them on the cluster, when it's up again.
- Cleanup task removes instances for all the regions. If a data plane cluster is unavailable in one region, it fails for both (because of #2) and keeps instances in the healthy region.
- The prometheus rule is not filtered by the region, so if only one cluster is down, the alert will not be fired.