Loading...

XML

Word

Printable

Type: Task
Resolution: Done
Priority: Major
Fix Version/s: None
Affects Version/s: None
Component/s: None
Labels:
- operations

Activity Type:
Future Sustainability
Epic Link:
ROX-25121
Blocked:
False
Blocked Reason:

Hide

None

Show
None
Ready:
False
Intelligence Requested:
Market:

SFDC Cases Counter:
SFDC Cases Open:
SFDC Cases Links:

This floods Fleet Manager with pending requests, and once the data plane cluster becomes available again, it results in a dramatic increase in resource consumption and numerous alerts because the cluster is unable to handle this load.

Problems:

Cleanup is called after the instance creation. If the dataplane is unavailable, instance can be neither provisioned nor deleted, so even if the cleanup task fails, the probe will create a new one on the next iteration.
If multiple instances need to be deleted, the cleanup task waits until the first instance is deleted and then fails to execute the others by timeout. This leaves (N-1) instances pending, but they can at least be marked as deprovisioning so that fleetshard sync doesn't create resources for them on the cluster, when it's up again.
Cleanup task removes instances for all the regions. If a data plane cluster is unavailable in one region, it fails for both (because of #2) and keeps instances in the healthy region.
The prometheus rule is not filtered by the region, so if only one cluster is down, the alert will not be fired.

links to

stackrox/acs-fleet-manager#2228: ROX-28194: Probe cleanup before execute

mentioned on

Merge request - ROX-28194: ACS Probe prometheus rules improvements

Merge request - ROX-28194: Probe delete specs

Assignee:: Yury Kovalev

Reporter:: Yury Kovalev

Team:: ACS Cloud Service

Votes:: 0 Vote for this issue

Watchers:: 2 Start watching this issue

Created:: 2025/02/20 9:12 AM

Updated:: 2025/06/27 5:12 AM

Resolved:: 2025/03/20 9:53 AM

Details

Description

Attachments

Issue Links

Easy Agile Planning Poker

Activity

People

Dates