-
Bug
-
Resolution: Unresolved
-
Normal
-
None
-
None
-
None
-
False
-
-
False
-
?
-
rhos-ops-day1day2-edpm
-
None
-
-
-
Informational
The OpenStackDataPlaneDeployment controller could get stuck if a job fails to create or gets deleted while the deployment is in progress. This seems due to the fact that the NodeSet conditions on the Deployment and the job hashes on the Deployment are used to determine if a job should be spawned or not.
lib-common's DoJob looks at the beforeHash which is passed in from ansible_execution.go's func AnsibleExecution. Since the hash is already saved on the Deployment, DoJob sees no change and doesn't CreateOrPatch the job, even if it's missing.
Furthermore, AnsibleExecution only gets called by DeployService, which in turn will only get called if the NodeSet condition for the job on the Deployment is unknown (not if set to False).
This is mainly a debug/dev thing since deleting a job should never happen in practice, however if the job does fail to create for some reason, I could see this causing the Deployment to get stuck.
The way to workaround it is to edit the Deployment's status subresource and delete the hash for the job and delete the NodeSet condition for the job. We could fix this by instead of guarding the job creation by the hashes and conditions, just always CreateOrPatch the job if it doesn't exist.