Uploaded image for project: 'Red Hat Advanced Cluster Security'
  1. Red Hat Advanced Cluster Security
  2. ROX-28194

Probe service keeps creating instances when a data plane cluster is unavailable

    • Icon: Task Task
    • Resolution: Done
    • Icon: Major Major
    • None
    • None
    • None
    • Future Sustainability
    • False
    • Hide

      None

      Show
      None
    • False
    • 0

      This floods Fleet Manager with pending requests, and once the data plane cluster becomes available again, it results in a dramatic increase in resource consumption and numerous alerts because the cluster is unable to handle this load.

      Problems:

      1. Cleanup is called after the instance creation. If the dataplane is unavailable, instance can be neither provisioned nor deleted, so even if the cleanup task fails, the probe will create a new one on the next iteration.
      2. If multiple instances need to be deleted, the cleanup task waits until the first instance is deleted and then fails to execute the others by timeout. This leaves (N-1) instances pending, but they can at least be marked as deprovisioning so that fleetshard sync doesn't create resources for them on the cluster, when it's up again.
      3. Cleanup task removes instances for all the regions. If a data plane cluster is unavailable in one region, it fails for both (because of #2) and keeps instances in the healthy region.
      4. The prometheus rule is not filtered by the region, so if only one cluster is down, the alert will not be fired.

              ykovalev@redhat.com Yury Kovalev
              ykovalev@redhat.com Yury Kovalev
              ACS Cloud Service
              Votes:
              0 Vote for this issue
              Watchers:
              2 Start watching this issue

                Created:
                Updated:
                Resolved: