Loading...

XML

Word

Printable

Type: Bug
Resolution: Unresolved
Priority: Undefined
Fix Version/s: None
Affects Version/s: 4.17.z
Component/s: HyperShift / ROSA
Labels:
None

Activity Type:
Quality / Stability / Reliability
Blocked:
False
Blocked Reason:

Hide

None

Show
None
Story Points:
None
Severity:
None
Regression:
None

Target Backport Versions:
None
Target Version:
None
Release Blocker:
None
Sprint:
None

SFDC Cases Counter:
SFDC Cases Open:
SFDC Cases Links:

PX Impact Score:

Release Note Status:
None
Release Note Type:
None
Release Note Text:
None

Escape Reason:
None
Escape Impact:
None
Corrective Measures:
None
SDLC stage when should've been found:
None

Description of problem:


During recent alert https://redhat.pagerduty.com/incidents/Q1NZBSK24IKWUF we noticed that the core issue was etcd-1 not
being able to schedule on any node. The core issue was that the etcd-1 pod had a PVC attached which blocked the 
scheduling on a different node. 

The solution we applied was to 
1. ensure that remaining etcd members are healthy by ssh into the pods and checking etcdctl endpoint health and forcing defragmentation
2. delete the PVC followed by deleting the etcd-1 pod so that it can be rescheduled on a healthy node
3. once the etcd-1 pod is up and healthy we checked the health of the etcd cluster again

The control plane operator didn't attempt to fix the issue, the autoscaler didn't add new nodes despite "nodes are available: 1 Insufficient cpu, 1 node(s) were unschedulable, 14 node(s) had volume node affinity conflict, 2 Insufficient memory, 2 Too many pods, 3 node(s) had untolerated taint {node-role.kubernetes.io/infra: }, 3 node(s) had untolerated taint {node-role.kubernetes.io/master: }, 3 node(s) had untolerated taint {obo: true}, 7 node(s) didn't match pod anti-affinity rules, 88 node(s) had untolerated taint {hypershift.openshift.io/request-serving-component: true}. preemption: 0/122 nodes are available: 112 Preemption is not helpful for scheduling, 2 node(s) had volume node affinity conflict, 8 node(s) didn't match pod anti-affinity rules."

There was an upgrade happening at the same time on the MC cluster which was attempting to replace the node with etcd-0.

    Version-Release number of selected component (if applicable):{code:none}

How reproducible:

Steps to Reproduce:

    1.
    2.
    3.

Actual results:

Expected results:

Additional info:

Assignee:: Unassigned

Reporter:: Petr Kotas

Need Info From:: None

Contributors:: None

QA Contact:: Yu Li

Doc Contact:: None

Votes:: 0 Vote for this issue

Watchers:: 3 Start watching this issue

Created:: 2025/10/22 12:23 PM

Updated:: 2025/11/10 3:29 PM

Details

Description

Attachments

Easy Agile Planning Poker

Activity

People

Dates