Loading...

XML

Word

Printable

Type: Story
Resolution: Unresolved
Priority: Normal
Fix Version/s: None
Affects Version/s: None
Component/s: QE
Labels:
None

Blocked:
False
Blocked Reason:

Hide

None

Show
None
Ready:
False
Intelligence Requested:
Market:

SFDC Cases Links:
SFDC Cases Open:
SFDC Cases Counter:

We currently have a job that periodically sends openstack usage to a mailing list. Problem is that nobody seems to read that email, probably because there's too much info but non of it actionable.

There several issues:

some non-hive clusters are forgotten
sometimes the destroy job for non-hive clusters fails but this goes unnoticed, leaving some resources in openstack
automatic uninstall of hive clusters sometimes fails, which goes unnoticed, leaving some resources in openstack

For 1, i suggesting changing the format of the periodic email. Rather than listing all openstack machines, we can use a script like this to summarize them into running clusters:

openstack server list -f value -c Name | grep -E 'master|worker' | sed 's/-\(master\|worker\).*$//' | sort | uniq -c

this will give a list with a name of the cluster, along with the number of machines running. This will give a much cleaner view of what is running and who's responsibility is to clean up. We can also point to the manual delete script so that people can self-delete clusters that cannot be deleted using the jenkins jobs.

For 2, the solution will be more complicated, since we will need to use a script similar to 1 to get a list of running clusters, then use that to filter out resources (such as openstack networks, routers, ports etc) that are orphaned and delete those.

For 3, we'll need to reference which hive clusters are currently running and delete resoruces belonging to older deployments of the cluster, e.g.

kubectl get clusterdeploymentcustomizations -o jsonpath='{range .items[*]}{.status.clusterDeploymentRef.name}{"\n"}{end}' | xargs -I{} kubectl get clusterdeployment -n {} {} -o jsonpath='{.spec.clusterMetadata.infraID}{"\n"}'

shows us that there's a cluster user-rhos-d-5-h6skp, but using the script from 1, we see there's also user-rhos-d-5-95sp8 with two machines running - these are orphaned resources from a previous unsuccessful install of the user-rhos-d-5 cluster and should be deleted. We can just pass the full cluster tag to the manual delete script to delete all the dangling resources.

We can probably automate 3, but automating 2 is risky, because there's a chance we'll delete resources that are in use.

There are no Sub-Tasks for this issue.

Assignee:: Unassigned

Reporter:: Andrej Smigala

Votes:: 0 Vote for this issue

Watchers:: 4 Start watching this issue

Created:: 2024/05/31 11:20 AM

Updated:: 2025/06/06 5:25 AM

Details

Description

Attachments

Easy Agile Planning Poker

Sub-Tasks

Activity

People

Dates