-
Bug
-
Resolution: Obsolete
-
Major
-
None
-
4.18
-
None
-
Quality / Stability / Reliability
-
False
-
-
None
-
None
-
None
-
None
-
None
-
None
-
None
-
contract-priority
-
-
None
-
None
-
None
-
None
-
None
-
None
-
None
During Platform's resiliency testing (restarting master nodes one by one, further details below), customer found worker performanceprofile in degraded state:
$oc get performanceprofile worker-profile -o yaml [..] - lastHeartbeatTime: "2025-09-22T18:52:31Z" lastTransitionTime: "2025-09-22T18:52:31Z" message: 'rpc error: code = Unavailable desc = keepalive ping failed to receive ACK within timeout' reason: ComponentCreationFailed status: "True" type: Degraded
$ oc get pods -n openshift-cluster-node-tuning-operator -o wide NAME READY STATUS RESTARTS AGE IP NODE NOMINATED NODE READINESS GATES cluster-node-tuning-operator-759544dd89-qq5w2 1/1 Running 1 21h 172.21.6.84 master1 <none> <none> tuned-2hhhj 1/1 Running 3 1d 10.17.96.151 master1 <none> <none> tuned-5gkq8 1/1 Running 3 1d 10.17.96.153 worker0 <none> <none> tuned-98mr7 1/1 Running 1 1d 10.17.96.154 worker1 <none> <none> tuned-drdkl 1/1 Running 4 1d 10.17.96.150 master0 <none> <none> tuned-nfdh8 1/1 Running 5 1d 10.17.96.152 master2 <none> <none>
$ oc logs -n openshift-cluster-node-tuning-operator tuned-98mr7 2025-09-22T18:30:27.704821096Z W0922 18:30:27.704786 12661 reflector.go:484] github.com/openshift/cluster-node-tuning-operator/pkg/generated/informers/externalversions/factory.go:140: watch of *v1.Profile ended with: an error on the server ("unable to decode an event from the watch stream: http2: client connection lost") has prevented the request from succeeding 2025-09-22T18:52:50.139863540Z W0922 18:52:50.139824 12661 reflector.go:484] github.com/openshift/cluster-node-tuning-operator/pkg/generated/informers/externalversions/factory.go:140: watch of *v1.Profile ended with: an error on the server ("unable to decode an event from the watch stream: http2: client connection lost") has prevented the request from succeeding 2025-09-22T18:53:21.306887498Z W0922 18:53:21.306844 12661 reflector.go:561] github.com/openshift/cluster-node-tuning-operator/pkg/generated/informers/externalversions/factory.go:140: failed to list *v1.Profile: Get "https://172.22.0.1:443/apis/tuned.openshift.io/v1/namespaces/openshift-cluster-node-tuning-operator/profiles?resourceVersion=1995140": dial tcp 172.22.0.1:443: i/o timeout 2025-09-22T18:53:21.306916507Z E0922 18:53:21.306904 12661 reflector.go:158] "Unhandled Error" err="github.com/openshift/cluster-node-tuning-operator/pkg/generated/informers/externalversions/factory.go:140: Failed to watch *v1.Profile: failed to list *v1.Profile: Get \"https://172.22.0.1:443/apis/tuned.openshift.io/v1/namespaces/openshift-cluster-node-tuning-operator/profiles?resourceVersion=1995140\": dial tcp 172.22.0.1:443: i/o timeout" logger="UnhandledError"
The only way to fix this was by deleting restarting the problematic "tuned-98mr7" pod.
Version-Release number of selected component (if applicable):
OCP 4.18.22
How reproducible:
not 100% reproducible
Steps to Reproduce:
1.restart OCP masters one by one (waiting for the cluster getting back in an healthy state between each task) 2.verify for performanceprofiles getting into degraded state
Actual results:
Performanceprofile might get into degraded state and the only way to fix it is by restarting the pod
Additional info (about testing method):
> How did you reboot the hosts? Node restart is done through redfish with curl command. > Did you wait for each master node being up and the cluster healthy again before moving to the next one? Yes. Case is monitoring master node status first through ping and redfish and when node is up and running, through oc command: "oc get nodes". Only after previously restarted master is visible in node configuration, second one is restarted.
- is related to
-
OCPBUGS-62991 Performance Profile got degraded on MNO deployments
-
- New
-