Uploaded image for project: 'OpenShift Bugs'
  1. OpenShift Bugs
  2. OCPBUGS-37273

Tuned profiles are degraded in RHOCP4

XMLWordPrintable

    • Icon: Bug Bug
    • Resolution: Duplicate
    • Icon: Major Major
    • None
    • 4.12.z
    • Node Tuning Operator
    • Important
    • None
    • False
    • Hide

      None

      Show
      None

      Description of problem:

      3 profiles are degarded in the cluster :
      ~~~
      $ oc get profile -n openshift-cluster-node-tuning-operator NAME                                         TUNED                         APPLIED   DEGRADED   AGE
      svg1ocpims1-wrk-4    r750-28c-std1-tuned           True      True       99d
      svg1ocpims1-wrk-12   r750-28c-std1-tuned           True      True       45d
      svg1ocpims1-wrk-15   r750-28c-std1-tuned           True      True       98d
      ~~~
      
      whereas there are 19 nodes part of the same MCP :
      ~~~
      $ oc get mcp 
      NAME                    CONFIG                                                            UPDATED   UPDATING   DEGRADED   MACHINECOUNT   READYMACHINECOUNT   UPDATEDMACHINECOUNT   DEGRADEDMACHINECOUNT   AGE
      r750-28c-std1           rendered-r750-28c-std1-f2f4b895b5e5ea5558424b25b1dd4e46           True      False      False      19             19                  19                    0                      160d
      ~~~
      
      Applied tuned :
      ~~~
          name: r750-28c-std1-tuned
        recommend:
        - machineConfigLabels:
            machineconfiguration.openshift.io/role: r750-28c-std1
          priority: 10
          profile: r750-28c-std1-tuned
      ~~~

      Actual results:

      3 tuned profiles are degarded in the cluster.

      Expected results:

      All tuned profiles must be Healthy in the cluster.

      Additional info:

      As per shared must-gather report ->
      
      Tuned pod logs from the pods running on the nodes for which profiles are degraded :
      ~~~
      2024-07-11T16:07:59.987436556+02:00 2024-07-11 14:07:59,987 ERROR    tuned.plugins.plugin_scheduler: Failed to set affinity of PID 1473578 to '[0, 1, 2, 3, 56, 57, 58, 59]': [Errno 22] Invalid argument
      2024-07-11T16:08:00.372042362+02:00 E0711 14:08:00.371999 1471548 controller.go:880] unable to sync(daemon/) requeued (5)
      ~~~
      
      Below error in the profile yaml : 
      ~~~
        - lastTransitionTime: "2024-07-11T10:18:49Z"
          message: 'TuneD daemon issued one or more error message(s) during profile application.
            TuneD stderr:  ERROR    tuned.plugins.plugin_scheduler: Failed to set affinity
            of PID 1552564 to ''[0, 1, 2, 3, 56, 57, 58, 59]'': [Errno 22] Invalid argument'
          reason: TunedError
          status: "True"
          type: Degraded
      ~~~
      
      The same PID I observed in the crio logs as per shared SOS-report (before node reboot) which belongs to the defunct process :
      ~~~
      $ cat 0010-sosreport-svg1ocpims1-wrk-12-2024-07-12-dfcecid.tar.xz/sosreport-svg1ocpims1-wrk-12-2024-07-12-dfcecid/sos_commands/crio/journalctl_--no-pager_--unit_crio | grep -i 1552564
      Jul 05 17:40:41 svg1ocpims1-wrk-12 crio[18668]: time="2024-07-05 17:40:41.401788209+02:00" level=warning msg="Found defunct process with PID 1552564 (monit-systemfd-)"
      Jul 05 17:40:44 svg1ocpims1-wrk-12 crio[18668]: time="2024-07-05 17:40:44.168513830+02:00" level=warning msg="Found defunct process with PID 1552564 (monit-systemfd-)"
      ~~~
      
      Customer tried rebooting this node svg1ocpims1-wrk-12 , still observed defunct processes.

              jmencak Jiri Mencak
              rhn-support-sdharma Suruchi Dharma
              Liquan Cui Liquan Cui
              Votes:
              0 Vote for this issue
              Watchers:
              4 Start watching this issue

                Created:
                Updated:
                Resolved: