-
Story
-
Resolution: Unresolved
-
Normal
-
None
-
None
-
None
-
False
-
-
False
-
-
We currently have a job that periodically sends openstack usage to a mailing list. Problem is that nobody seems to read that email, probably because there's too much info but non of it actionable.
There several issues:
- some non-hive clusters are forgotten
- sometimes the destroy job for non-hive clusters fails but this goes unnoticed, leaving some resources in openstack
- automatic uninstall of hive clusters sometimes fails, which goes unnoticed, leaving some resources in openstack
For 1, i suggesting changing the format of the periodic email. Rather than listing all openstack machines, we can use a script like this to summarize them into running clusters:
openstack server list -f value -c Name | grep -E 'master|worker' | sed 's/-\(master\|worker\).*$//' | sort | uniq -c
this will give a list with a name of the cluster, along with the number of machines running. This will give a much cleaner view of what is running and who's responsibility is to clean up. We can also point to the manual delete script so that people can self-delete clusters that cannot be deleted using the jenkins jobs.
For 2, the solution will be more complicated, since we will need to use a script similar to 1 to get a list of running clusters, then use that to filter out resources (such as openstack networks, routers, ports etc) that are orphaned and delete those.
For 3, we'll need to reference which hive clusters are currently running and delete resoruces belonging to older deployments of the cluster, e.g.
kubectl get clusterdeploymentcustomizations -o jsonpath='{range .items[*]}{.status.clusterDeploymentRef.name}{"\n"}{end}' | xargs -I{} kubectl get clusterdeployment -n {} {} -o jsonpath='{.spec.clusterMetadata.infraID}{"\n"}'
shows us that there's a cluster user-rhos-d-5-h6skp, but using the script from 1, we see there's also user-rhos-d-5-95sp8 with two machines running - these are orphaned resources from a previous unsuccessful install of the user-rhos-d-5 cluster and should be deleted. We can just pass the full cluster tag to the manual delete script to delete all the dangling resources.
We can probably automate 3, but automating 2 is risky, because there's a chance we'll delete resources that are in use.