Loading...

XML

Word

Printable

Type: Bug
Resolution: Duplicate
Priority: Major
Fix Version/s: None
Affects Version/s: 4.12.z
Component/s: Node Tuning Operator
Labels:
- node
- tuned

Severity:
Important
Regression:
None
Blocked:
False
Blocked Reason:

Hide

None

Show
None
RH Private Keywords:

SFDC Cases Counter:
SFDC Cases Open:
SFDC Cases Links:

PX Impact Score:
PX Priority Data:

Description of problem:

3 profiles are degarded in the cluster :
~~~
$ oc get profile -n openshift-cluster-node-tuning-operator NAME                                         TUNED                         APPLIED   DEGRADED   AGE
svg1ocpims1-wrk-4    r750-28c-std1-tuned           True      True       99d
svg1ocpims1-wrk-12   r750-28c-std1-tuned           True      True       45d
svg1ocpims1-wrk-15   r750-28c-std1-tuned           True      True       98d
~~~

whereas there are 19 nodes part of the same MCP :
~~~
$ oc get mcp 
NAME                    CONFIG                                                            UPDATED   UPDATING   DEGRADED   MACHINECOUNT   READYMACHINECOUNT   UPDATEDMACHINECOUNT   DEGRADEDMACHINECOUNT   AGE
r750-28c-std1           rendered-r750-28c-std1-f2f4b895b5e5ea5558424b25b1dd4e46           True      False      False      19             19                  19                    0                      160d
~~~

Applied tuned :
~~~
    name: r750-28c-std1-tuned
  recommend:
  - machineConfigLabels:
      machineconfiguration.openshift.io/role: r750-28c-std1
    priority: 10
    profile: r750-28c-std1-tuned
~~~

Actual results:

3 tuned profiles are degarded in the cluster.

Expected results:

All tuned profiles must be Healthy in the cluster.

Additional info:

As per shared must-gather report ->

Tuned pod logs from the pods running on the nodes for which profiles are degraded :
~~~
2024-07-11T16:07:59.987436556+02:00 2024-07-11 14:07:59,987 ERROR    tuned.plugins.plugin_scheduler: Failed to set affinity of PID 1473578 to '[0, 1, 2, 3, 56, 57, 58, 59]': [Errno 22] Invalid argument
2024-07-11T16:08:00.372042362+02:00 E0711 14:08:00.371999 1471548 controller.go:880] unable to sync(daemon/) requeued (5)
~~~

Below error in the profile yaml : 
~~~
  - lastTransitionTime: "2024-07-11T10:18:49Z"
    message: 'TuneD daemon issued one or more error message(s) during profile application.
      TuneD stderr:  ERROR    tuned.plugins.plugin_scheduler: Failed to set affinity
      of PID 1552564 to ''[0, 1, 2, 3, 56, 57, 58, 59]'': [Errno 22] Invalid argument'
    reason: TunedError
    status: "True"
    type: Degraded
~~~

The same PID I observed in the crio logs as per shared SOS-report (before node reboot) which belongs to the defunct process :
~~~
$ cat 0010-sosreport-svg1ocpims1-wrk-12-2024-07-12-dfcecid.tar.xz/sosreport-svg1ocpims1-wrk-12-2024-07-12-dfcecid/sos_commands/crio/journalctl_--no-pager_--unit_crio | grep -i 1552564
Jul 05 17:40:41 svg1ocpims1-wrk-12 crio[18668]: time="2024-07-05 17:40:41.401788209+02:00" level=warning msg="Found defunct process with PID 1552564 (monit-systemfd-)"
Jul 05 17:40:44 svg1ocpims1-wrk-12 crio[18668]: time="2024-07-05 17:40:44.168513830+02:00" level=warning msg="Found defunct process with PID 1552564 (monit-systemfd-)"
~~~

Customer tried rebooting this node svg1ocpims1-wrk-12 , still observed defunct processes.

Assignee:: Jiri Mencak

Reporter:: Suruchi Dharma

QA Contact:: Liquan Cui

Votes:: 0 Vote for this issue

Watchers:: 4 Start watching this issue

Created:: 2024/07/18 5:58 PM

Updated:: 2024/07/22 10:24 AM

Resolved:: 2024/07/22 10:24 AM

Details

Description

Attachments

Easy Agile Planning Poker

Activity

People

Dates