Loading...

XML

Word

Printable

Type: Bug
Resolution: Not a Bug
Priority: Normal
Fix Version/s: None
Affects Version/s: 4.14.z, 4.15.z, 4.17.z, 4.16.z
Component/s: Node Maintenance Operator
Labels:
None

Activity Type:
Quality / Stability / Reliability
Blocked:
False
Blocked Reason:

Hide

None

Show
None
Story Points:
None
Severity:
Moderate
Regression:
None

Target Backport Versions:
None
Target Version:
None
Release Blocker:
None
Sprint:
None

SFDC Cases Counter:
SFDC Cases Open:
SFDC Cases Links:

PX Review Complete:
PX Impact Score:

Release Note Status:
None
Release Note Type:
None
Release Note Text:
None

Escape Reason:
None
Escape Impact:
None
Corrective Measures:
None
SDLC stage when should've been found:
None

Description of problem:

    Replacement of ETCD member needs to be tested and changed when openshift-workload-availability operators are installed and configured on bare metal IPIs, because they are causing multiple issues when trying to remove master node and at the moment there is no way to disable NHC and remediations without having to completely remove the operators.

Once the node is turned off these operators send a signal to management boards to force all nodes to be rebooted. This causes multiple issues in the cluster since a very sensitive task is being done with the ETCD member removal.

Upon checking customer saw this event in the iDRAC logs of their servers:

Wed Nov 20 2024 08:42:38 : The watchdog timer reset the system.

Also deleting the SelfNodeRemediationConfig CR doesn't seem to work as the operator recreates it.

Version-Release number of selected component (if applicable):

   OCP 4.14 and above

How reproducible:

    Always when openshift-workload-availability operators are installed. Without them replacement works without any issues.

Steps to Reproduce:

    1. Bare Metal IPI cluster installed
    2. Have openshift-workload-availability operators installed and configured
    3. Test replacement of ETCD member following our documented procedure [1]

[1] https://docs.openshift.com/container-platform/4.14/backup_and_restore/control_plane_backup_and_restore/replacing-unhealthy-etcd-member.html#replacing-the-unhealthy-etcd-member

Actual results:

Expected results:

Additional info:

    This upstream issue seems to be relevant - https://github.com/medik8s/self-node-remediation/issues/219

links to

workaround when ETCD replacement is nedeed

Assignee:: Or Raz

Reporter:: Andre Costa

QA Contact:: Anna Savina Frances

Need Info From:: None

Votes:: 0 Vote for this issue

Watchers:: 4 Start watching this issue

Created:: 2024/12/19 2:19 PM

Updated:: 2025/09/13 8:14 PM

Resolved:: 2025/02/04 12:10 PM

Details

Description

Attachments

Issue Links

Easy Agile Planning Poker

Activity

People

Dates