Loading...

XML

Word

Printable

Type: Bug
Resolution: Unresolved
Priority: Normal
Fix Version/s: None
Affects Version/s: 4.14.z
Component/s: NVIDIA
Labels:
- EC2
- ai
- bare-metal
- nvidia

Activity Type:
Quality / Stability / Reliability
Blocked:
False
Blocked Reason:

Hide

None

Show
None
Story Points:
None
Severity:
Low
Regression:
None

Target Backport Versions:
None
Target Version:
None
Release Blocker:
None
Sprint:
None

SFDC Cases Counter:
SFDC Cases Open:
SFDC Cases Links:

PX Impact Score:

Release Note Status:
None
Release Note Type:
None
Release Note Text:
None

Escape Reason:
None
Escape Impact:
None
Corrective Measures:
None
SDLC stage when should've been found:
None

Description of problem:

Running on EC2 with large hardware spec worker nodes, such as  G6.48xlarge with a NIC of 100Gbps, the GPU operator will sometimes fail to connect on the node with an error in the log

Unable to connect to server: dial tcp <some redacted gateway ip>: i/o timeout"   The next line is "Unable to get value of the 'nvidia.com/gpu.deploy.gpu-feature-discovery' label"

The failure is a crashloopbackof of the nvidia device plugin validator pod which is complaining that "allocate failed due to required number of devices unavailable for nvidia.com/gpu. Requested 1, Available 0 which is unexpected".  Also there is an associate nvidia-operator-validator that fails to initialize on the same node due to the failed validation of tje plugin validation container  which I assume is related to the previous error.

Rebooting the node seems to resolve the issue.  There are times where it will happen repeatedly, and other times it won't happen for days, but it's a problem when using an autoscaler.

It never seems to happen when using nodes with lower memory and network bandwidth.

Version-Release number of selected component (if applicable):

OCP 4.14

How reproducible:

    Sometimes

Steps to Reproduce:

    1. Setup a cluster on EC2
    2. Install the nvidia operator and OpenShift AI
    3. Scale up the cluster using GPU node spec G6.48xlarge with a NIC of 100Gbps.

Actual results:

    The GPU operator fails about 1 in 5 times with error

Unable to connect to server: dial tcp <some redacted gateway ip>: i/o timeout"   The next line is "Unable to get value of the 'nvidia.com/gpu.deploy.gpu-feature-discovery' label"

Expected results:

    That the node starts properly and AI workloads are able to access the GPU

Additional info:

Rebooting the node always seems to resolve the issue.

Assignee:: Vitaliy Emporopulo

Reporter:: David Guthrie

Need Info From:: None

Contributors:: None

QA Contact:: None

Doc Contact:: None

Votes:: 0 Vote for this issue

Watchers:: 2 Start watching this issue

Created:: 2025/03/05 10:29 PM

Updated:: 2025/08/02 3:24 PM

Details

Description

Attachments

Easy Agile Planning Poker

Activity

People

Dates