-
Bug
-
Resolution: Done
-
Normal
-
None
-
4.12.z
-
No
-
Rejected
-
False
-
Description of problem:
SRE investigated a [NodeNotReady alert|https://redhat.pagerduty.com/incidents/Q03OFJXEUBFLRU] and found that a HostedCluster's node/machine marked NotReady/Unhealthy had been stuck in the "Deleting" state for almost an hour. CAPI logs from the management cluster seem to indicate a race between (unsuccessfully) updating the unhealthy machine's security groups and actually terminating it.
Version-Release number of selected component (if applicable):
Cluster Version: 4.12.22 CAPI image: quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:a70ac366dcbd7d3a2d091ec2e505547b035a7a670e6659ac45c6e2708adcdacb
How reproducible:
Unclear
Steps to Reproduce:
1. Observe an HCP worker node for which CAPI is trying to update its security groups but receiving "Client.InvalidPermission.Duplicate" error 2. Make that node go Unhealthy 3. Observe CAPI's reaction
Actual results:
CAPI tries and fails to update security groups, and then stops processing that node's reconciliation, failing to delete/replace the unhealthy machine... until it does (an hour later in this case)
Expected results:
CAPI deletes and replaces the unhealthy machine within a few minutes
Additional info:
It's not clear what made CAPI decide to stop trying to reconcile the security group (after an hour of trying) and skip straight to deleting the unhealthy machine