Loading...

XML

Word

Printable

Type: Bug
Resolution: Obsolete
Priority: Major
Fix Version/s: None
Affects Version/s: 4.18
Component/s: Node Tuning Operator
Labels:
None

Activity Type:
Quality / Stability / Reliability
Blocked:
False
Blocked Reason:

Hide

None

Show
None
Story Points:
None
Severity:
None
Regression:
None

Target Backport Versions:
None
Target Version:
None
Release Blocker:
None
Sprint:
None

Special Handling:

contract-priority
RH Private Keywords:

SFDC Cases Counter:
SFDC Cases Open:
SFDC Cases Links:

PX Review Complete:
PX Priority Data:
PX Impact Score:
PX Technical Impact:
PX Impact Range:

Release Note Status:
None
Release Note Type:
None
Release Note Text:
None

Escape Reason:
None
Escape Impact:
None
Corrective Measures:
None
SDLC stage when should've been found:
None

During Platform's resiliency testing (restarting master nodes one by one, further details below), customer found worker performanceprofile in degraded state:

$oc get performanceprofile worker-profile -o yaml
[..]
  - lastHeartbeatTime: "2025-09-22T18:52:31Z"
    lastTransitionTime: "2025-09-22T18:52:31Z"
    message: 'rpc error: code = Unavailable desc = keepalive ping failed to receive
      ACK within timeout'
    reason: ComponentCreationFailed
    status: "True"
    type: Degraded

$ oc get pods -n openshift-cluster-node-tuning-operator -o wide
NAME                                            READY   STATUS    RESTARTS   AGE   IP             NODE      NOMINATED NODE   READINESS GATES
cluster-node-tuning-operator-759544dd89-qq5w2   1/1     Running   1          21h   172.21.6.84    master1   <none>           <none>
tuned-2hhhj                                     1/1     Running   3          1d    10.17.96.151   master1   <none>           <none>
tuned-5gkq8                                     1/1     Running   3          1d    10.17.96.153   worker0   <none>           <none>
tuned-98mr7                                     1/1     Running   1          1d    10.17.96.154   worker1   <none>           <none>
tuned-drdkl                                     1/1     Running   4          1d    10.17.96.150   master0   <none>           <none>
tuned-nfdh8                                     1/1     Running   5          1d    10.17.96.152   master2   <none>           <none>

$ oc logs -n openshift-cluster-node-tuning-operator tuned-98mr7 

2025-09-22T18:30:27.704821096Z W0922 18:30:27.704786   12661 reflector.go:484] github.com/openshift/cluster-node-tuning-operator/pkg/generated/informers/externalversions/factory.go:140: watch of *v1.Profile ended with: an error on the server ("unable to decode an event from the watch stream: http2: client connection lost") has prevented the request from succeeding
2025-09-22T18:52:50.139863540Z W0922 18:52:50.139824   12661 reflector.go:484] github.com/openshift/cluster-node-tuning-operator/pkg/generated/informers/externalversions/factory.go:140: watch of *v1.Profile ended with: an error on the server ("unable to decode an event from the watch stream: http2: client connection lost") has prevented the request from succeeding
2025-09-22T18:53:21.306887498Z W0922 18:53:21.306844   12661 reflector.go:561] github.com/openshift/cluster-node-tuning-operator/pkg/generated/informers/externalversions/factory.go:140: failed to list *v1.Profile: Get "https://172.22.0.1:443/apis/tuned.openshift.io/v1/namespaces/openshift-cluster-node-tuning-operator/profiles?resourceVersion=1995140": dial tcp 172.22.0.1:443: i/o timeout
2025-09-22T18:53:21.306916507Z E0922 18:53:21.306904   12661 reflector.go:158] "Unhandled Error" err="github.com/openshift/cluster-node-tuning-operator/pkg/generated/informers/externalversions/factory.go:140: Failed to watch *v1.Profile: failed to list *v1.Profile: Get \"https://172.22.0.1:443/apis/tuned.openshift.io/v1/namespaces/openshift-cluster-node-tuning-operator/profiles?resourceVersion=1995140\": dial tcp 172.22.0.1:443: i/o timeout" logger="UnhandledError"

The only way to fix this was by deleting restarting the problematic "tuned-98mr7" pod.

Version-Release number of selected component (if applicable):

    OCP 4.18.22

How reproducible:

    not 100% reproducible

Steps to Reproduce:

    1.restart OCP masters one by one (waiting for the cluster getting back in an healthy state between each task)
    2.verify for performanceprofiles getting into degraded state

Actual results:

    Performanceprofile might get into degraded state and the only way to fix it is by restarting the pod

Additional info (about testing method):

> How did you reboot the hosts?

Node restart is done through redfish with curl command.

> Did you wait for each master node being up and the cluster healthy again before moving to the next one?

Yes. Case is monitoring master node status first through ping and redfish and when node is up and running, through oc command: "oc get nodes". Only after previously restarted master is visible in node configuration, second one is restarted.

is related to

OCPBUGS-62991 Performance Profile got degraded on MNO deployments

Assignee:: Martin Sivak

Reporter:: Flavio Piccioni

Need Info From:: None

Contributors:: None

QA Contact:: Liquan Cui

Doc Contact:: None

Votes:: 1 Vote for this issue

Watchers:: 5 Start watching this issue

Due:: 2025/12/29

Created:: 2025/09/26 1:52 PM

Updated:: 2025/10/15 10:17 AM

Resolved:: 2025/10/14 9:46 AM

Details

Description

Attachments

Issue Links

Easy Agile Planning Poker

Activity

People

Dates

Hide