Loading...

XML

Word

Printable

Type: Bug
Resolution: Done
Priority: Normal
Fix Version/s: None
Affects Version/s: 4.12.z
Component/s: HyperShift
Labels:

Regression:
No
Release Blocker:
Rejected
Blocked:
False
Blocked Reason:

Hide

None

Show
None

SFDC Cases Counter:
SFDC Cases Open:
SFDC Cases Links:

Description of problem:

SRE investigated a [NodeNotReady alert|https://redhat.pagerduty.com/incidents/Q03OFJXEUBFLRU] and found that a HostedCluster's node/machine marked NotReady/Unhealthy had been stuck in the "Deleting" state for almost an hour. CAPI logs from the management cluster seem to indicate a race between (unsuccessfully) updating the unhealthy machine's security groups and actually terminating it.

Version-Release number of selected component (if applicable):

Cluster Version: 4.12.22
CAPI image: quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:a70ac366dcbd7d3a2d091ec2e505547b035a7a670e6659ac45c6e2708adcdacb

How reproducible:

Unclear

Steps to Reproduce:

1. Observe an HCP worker node for which CAPI is trying to update its security groups but receiving "Client.InvalidPermission.Duplicate" error
2. Make that node go Unhealthy 
3. Observe CAPI's reaction

Actual results:

CAPI tries and fails to update security groups, and then stops processing that node's reconciliation, failing to delete/replace the unhealthy machine... until it does (an hour later in this case)

Expected results:

CAPI deletes and replaces the unhealthy machine within a few minutes

Additional info:

It's not clear what made CAPI decide to stop trying to reconcile the security group (after an hour of trying) and skip straight to deleting the unhealthy machine

links to

Merged 🌱 Propagate timeout fields from MachineSet to Machine during Machine deletion #10589

Assignee:: Alberto Garcia Lamela

Reporter:: Anthony Byrne

QA Contact:: Jie Zhao

Votes:: 0 Vote for this issue

Watchers:: 6 Start watching this issue

Created:: 2023/08/08 10:35 PM

Updated:: 2024/09/02 5:00 PM

Resolved:: 2024/09/02 5:00 PM

Details

Description

Attachments

Issue Links

Easy Agile Planning Poker

Activity

People

Dates