Uploaded image for project: 'OpenShift Bugs'
  1. OpenShift Bugs
  2. OCPBUGS-62277

PerformanceProfile degraded after master nodes reboot in MNO

XMLWordPrintable

    • Icon: Bug Bug
    • Resolution: Obsolete
    • Icon: Major Major
    • None
    • 4.18
    • Node Tuning Operator
    • None
    • Quality / Stability / Reliability
    • False
    • Hide

      None

      Show
      None
    • None
    • None
    • None
    • None
    • None
    • None
    • None
    • contract-priority
    • None
    • None
    • None
    • None
    • None
    • None
    • None

      During Platform's resiliency testing (restarting master nodes one by one, further details below), customer found worker performanceprofile in degraded state:

       

      $oc get performanceprofile worker-profile -o yaml
      [..]
        - lastHeartbeatTime: "2025-09-22T18:52:31Z"
          lastTransitionTime: "2025-09-22T18:52:31Z"
          message: 'rpc error: code = Unavailable desc = keepalive ping failed to receive
            ACK within timeout'
          reason: ComponentCreationFailed
          status: "True"
          type: Degraded

       

      $ oc get pods -n openshift-cluster-node-tuning-operator -o wide
      NAME                                            READY   STATUS    RESTARTS   AGE   IP             NODE      NOMINATED NODE   READINESS GATES
      cluster-node-tuning-operator-759544dd89-qq5w2   1/1     Running   1          21h   172.21.6.84    master1   <none>           <none>
      tuned-2hhhj                                     1/1     Running   3          1d    10.17.96.151   master1   <none>           <none>
      tuned-5gkq8                                     1/1     Running   3          1d    10.17.96.153   worker0   <none>           <none>
      tuned-98mr7                                     1/1     Running   1          1d    10.17.96.154   worker1   <none>           <none>
      tuned-drdkl                                     1/1     Running   4          1d    10.17.96.150   master0   <none>           <none>
      tuned-nfdh8                                     1/1     Running   5          1d    10.17.96.152   master2   <none>           <none> 

       

       

       

      $ oc logs -n openshift-cluster-node-tuning-operator tuned-98mr7 
      
      2025-09-22T18:30:27.704821096Z W0922 18:30:27.704786   12661 reflector.go:484] github.com/openshift/cluster-node-tuning-operator/pkg/generated/informers/externalversions/factory.go:140: watch of *v1.Profile ended with: an error on the server ("unable to decode an event from the watch stream: http2: client connection lost") has prevented the request from succeeding
      2025-09-22T18:52:50.139863540Z W0922 18:52:50.139824   12661 reflector.go:484] github.com/openshift/cluster-node-tuning-operator/pkg/generated/informers/externalversions/factory.go:140: watch of *v1.Profile ended with: an error on the server ("unable to decode an event from the watch stream: http2: client connection lost") has prevented the request from succeeding
      2025-09-22T18:53:21.306887498Z W0922 18:53:21.306844   12661 reflector.go:561] github.com/openshift/cluster-node-tuning-operator/pkg/generated/informers/externalversions/factory.go:140: failed to list *v1.Profile: Get "https://172.22.0.1:443/apis/tuned.openshift.io/v1/namespaces/openshift-cluster-node-tuning-operator/profiles?resourceVersion=1995140": dial tcp 172.22.0.1:443: i/o timeout
      2025-09-22T18:53:21.306916507Z E0922 18:53:21.306904   12661 reflector.go:158] "Unhandled Error" err="github.com/openshift/cluster-node-tuning-operator/pkg/generated/informers/externalversions/factory.go:140: Failed to watch *v1.Profile: failed to list *v1.Profile: Get \"https://172.22.0.1:443/apis/tuned.openshift.io/v1/namespaces/openshift-cluster-node-tuning-operator/profiles?resourceVersion=1995140\": dial tcp 172.22.0.1:443: i/o timeout" logger="UnhandledError"

       

       

      The only way to fix this was by deleting restarting the problematic "tuned-98mr7" pod.

      Version-Release number of selected component (if applicable):

          OCP 4.18.22

      How reproducible:

          not 100% reproducible

      Steps to Reproduce:

          1.restart OCP masters one by one (waiting for the cluster getting back in an healthy state between each task)
          2.verify for performanceprofiles getting into degraded state
          
          

      Actual results:

          Performanceprofile might get into degraded state and the only way to fix it is by restarting the pod

      Additional info (about testing method):

      > How did you reboot the hosts?
      
      Node restart is done through redfish with curl command.
      
      > Did you wait for each master node being up and the cluster healthy again before moving to the next one?
      
      Yes. Case is monitoring master node status first through ping and redfish and when node is up and running, through oc command: "oc get nodes". Only after previously restarted master is visible in node configuration, second one is restarted.    

       

              msivak@redhat.com Martin Sivak
              rh-ee-fpiccion Flavio Piccioni
              None
              None
              Liquan Cui Liquan Cui
              None
              Votes:
              1 Vote for this issue
              Watchers:
              5 Start watching this issue

                Created:
                Updated:
                Resolved: