Loading...

XML

Word

Printable

Type: Bug
Resolution: Unresolved
Priority: Normal
Fix Version/s: None
Affects Version/s: 4.14.z
Component/s: HyperShift
Labels:
- ServiceDeliveryImpact
- triaged

Activity Type:
Quality / Stability / Reliability
Blocked:
False
Blocked Reason:

Hide

None

Show
None
Story Points:
None
Severity:
None
Regression:
None

Target Backport Versions:
None
Target Version:
None
Release Blocker:
None
Sprint:
None

SFDC Cases Counter:
SFDC Cases Open:
SFDC Cases Links:

PX Impact Score:

Release Note Status:
None
Release Note Type:
None
Release Note Text:
None

Escape Reason:
None
Escape Impact:
None
Corrective Measures:
None
SDLC stage when should've been found:
None

Description of problem:

    A pod, specifically prometheus-k8s-0 within the openshift-monitoring namespace, is unable to start and remains in a Pending state. The primary error message is 0/3 nodes are available: 3 node(s) had volume node affinity conflict.

Investigation reveals that the PersistentVolumeClaim (PVC) prometheus-data-prometheus-k8s-0 is annotated with volume.kubernetes.io/selected-node: ip-10-51-2-129.us-west-2.compute.internal. However, this node (ip-10-51-2-129.us-west-2.compute.internal) no longer exists in the cluster.

The associated PersistentVolume (PV pvc-37f5792d-eff9-4304-8dcc-90215ff84f0f), which is an AWS EBS volume, retains an implicit or explicit node affinity to this non-existent node (likely to its original Availability Zone). Consequently, none of the currently available worker nodes (3 in this case, all located in us-west-2c) can satisfy the volume's node affinity requirements, leading to the scheduling failure. This effectively means the EBS volume is "leaked" or orphaned from the perspective of the active cluster nodes.

Version-Release number of selected component (if applicable):

    4.14.30

How reproducible:

    Unknown

Steps to Reproduce:

    1.Unknown
    2.
    3.

Actual results:

    Prometheus pod is stuck pending because of what seems like a previously leaked PVC

Expected results:

    Prometheus pod is able to schedule, PVC is in the right AZ / right node

Additional info:

Must-gather attached (see comments).  Could be related to CAPI's handling of node deletions (skipping volumes at some point), but unclear as it seems to be infrequent and no testing was able to reproduce this.    

WORKAROUND: 
- Delete the PVC & the pod for the pvc to be created on a new existing node.

Assignee:: Unassigned

Reporter:: Claudio Busse

Need Info From:: None

Contributors:: None

QA Contact:: Yu Li

Doc Contact:: None

Votes:: 0 Vote for this issue

Watchers:: 7 Start watching this issue

Created:: 2025/06/03 8:58 AM

Updated:: 2025/10/16 5:44 PM

Details

Description

Attachments

Easy Agile Planning Poker

Activity

People

Dates

Hide