-
Bug
-
Resolution: Duplicate
-
Major
-
None
-
4.12.z
-
Important
-
None
-
False
-
-
Description of problem:
3 profiles are degarded in the cluster : ~~~ $ oc get profile -n openshift-cluster-node-tuning-operator NAME TUNED APPLIED DEGRADED AGE svg1ocpims1-wrk-4 r750-28c-std1-tuned True True 99d svg1ocpims1-wrk-12 r750-28c-std1-tuned True True 45d svg1ocpims1-wrk-15 r750-28c-std1-tuned True True 98d ~~~ whereas there are 19 nodes part of the same MCP : ~~~ $ oc get mcp NAME CONFIG UPDATED UPDATING DEGRADED MACHINECOUNT READYMACHINECOUNT UPDATEDMACHINECOUNT DEGRADEDMACHINECOUNT AGE r750-28c-std1 rendered-r750-28c-std1-f2f4b895b5e5ea5558424b25b1dd4e46 True False False 19 19 19 0 160d ~~~ Applied tuned : ~~~ name: r750-28c-std1-tuned recommend: - machineConfigLabels: machineconfiguration.openshift.io/role: r750-28c-std1 priority: 10 profile: r750-28c-std1-tuned ~~~
Actual results:
3 tuned profiles are degarded in the cluster.
Expected results:
All tuned profiles must be Healthy in the cluster.
Additional info:
As per shared must-gather report -> Tuned pod logs from the pods running on the nodes for which profiles are degraded : ~~~ 2024-07-11T16:07:59.987436556+02:00 2024-07-11 14:07:59,987 ERROR tuned.plugins.plugin_scheduler: Failed to set affinity of PID 1473578 to '[0, 1, 2, 3, 56, 57, 58, 59]': [Errno 22] Invalid argument 2024-07-11T16:08:00.372042362+02:00 E0711 14:08:00.371999 1471548 controller.go:880] unable to sync(daemon/) requeued (5) ~~~ Below error in the profile yaml : ~~~ - lastTransitionTime: "2024-07-11T10:18:49Z" message: 'TuneD daemon issued one or more error message(s) during profile application. TuneD stderr: ERROR tuned.plugins.plugin_scheduler: Failed to set affinity of PID 1552564 to ''[0, 1, 2, 3, 56, 57, 58, 59]'': [Errno 22] Invalid argument' reason: TunedError status: "True" type: Degraded ~~~ The same PID I observed in the crio logs as per shared SOS-report (before node reboot) which belongs to the defunct process : ~~~ $ cat 0010-sosreport-svg1ocpims1-wrk-12-2024-07-12-dfcecid.tar.xz/sosreport-svg1ocpims1-wrk-12-2024-07-12-dfcecid/sos_commands/crio/journalctl_--no-pager_--unit_crio | grep -i 1552564 Jul 05 17:40:41 svg1ocpims1-wrk-12 crio[18668]: time="2024-07-05 17:40:41.401788209+02:00" level=warning msg="Found defunct process with PID 1552564 (monit-systemfd-)" Jul 05 17:40:44 svg1ocpims1-wrk-12 crio[18668]: time="2024-07-05 17:40:44.168513830+02:00" level=warning msg="Found defunct process with PID 1552564 (monit-systemfd-)" ~~~ Customer tried rebooting this node svg1ocpims1-wrk-12 , still observed defunct processes.