[OCPBUGS-47729] OCP 4.17+ | Node Tuning Operator got degraded when creating a PerformanceProfile with "Profiles with bootcmdline conflict" error message - Red Hat Issue Tracker

Type: Bug
Resolution: Unresolved
Priority: Undefined
Fix Version/s: None
Affects Version/s: 4.17.z, 4.18.z, 4.19.z
Component/s: Node Tuning Operator
Labels:
None

Regression:
None
Story Points:
3
Blocked:
False
Blocked Reason:

Hide

None

Show
None
Release Note Text:

Hide
This release fixes a race in TuneD occuring when two threads expanded variables at the same time. Prior to the fix, TuneD would write incorrect data to /etc/tuned/bootcmdline causing the NTO operator block kernel parameter updates for all machines within the Machine Config Pool the affected machine was part of. This would also result in the NTO operator entering a degraded state.

Show
This release fixes a race in TuneD occuring when two threads expanded variables at the same time. Prior to the fix, TuneD would write incorrect data to /etc/tuned/bootcmdline causing the NTO operator block kernel parameter updates for all machines within the Machine Config Pool the affected machine was part of. This would also result in the NTO operator entering a degraded state.
Release Note Type:
Bug Fix
Release Note Status:
In Progress
RH Private Keywords:

SFDC Cases Counter:
SFDC Cases Open:
SFDC Cases Links:

PX Impact Score:

Description of problem:

OCP 4.17+ | Node Tuning Operator got degraded when creating a PerformanceProfile with "Profiles with bootcmdline conflict" error message

Version-Release number of selected component (if applicable):

Appeared in latest nightlies from OCP 4.17/18/19

How reproducible:

The issue is not appearing all the times

Steps to Reproduce:

    1. Deploy OCP using IPI installer. In all the observed cases, cluster nodes were virtual machines managed by libvirt
    2. Create the following PerformanceProfile

---
kind: PerformanceProfile
apiVersion: "performance.openshift.io/v2"
metadata:
  name: libvirt-profile
spec:
  cpu:
    isolated: "6-23"
    reserved: "0-5"
  hugepages:
    pages:
      - size: "1G"
        count: 2
        node: 0
      - size: "2M"
        count: 1000
        node: 0
  numa:
    topologyPolicy: "restricted"
  nodeSelector:
    node-role.kubernetes.io/worker: ""
...

    3. Check node-tuning operator status

Actual results:

node-tuning operator is degraded, showing a message like this: "x/6 Profiles with bootcmdline conflict" (where x could be 1 or 2, at least in the errors we have detected).

From the must-gather and the cluster logs, we could see that:

- All cluster nodes were in Ready status
- All Tuned resources were in a correct status, and not in degraded state. Message said "TuneD profile applied"
- There were no issue with MCP or MC resources

These are the logs we could extract from one of the Tuned resources, in one of the failed cases:

2024-11-06T04:18:02.007091488Z E1106 04:18:02.007064       1 controller.go:788] not all 3 Nodes in MCP worker agree on bootcmdline: skew_tick=1 tsc=reliable rcupdate.rcu_normal_after_boot=1 nohz=on rcu_nocbs=6-23 tuned.non_isolcpus=0000003f systemd.cpu_affinity=0,1,2,3,4,5 intel_iommu=on iommu=pt isolcpus=managed_irq,6-23 nohz_full=6-23 tsc=reliable nosoftlockup nmi_watchdog=0 mce=off skew_tick=1 rcutree.kthread_prio=11 intel_pstate=active
2024-11-06T04:18:02.029044869Z E1106 04:18:02.028479       1 controller.go:788] not all 3 Nodes in MCP worker agree on bootcmdline: >4096active
2024-11-06T04:18:02.029631427Z I1106 04:18:02.029607       1 status.go:313] 1/6 Profiles with bootcmdline conflict
2024-11-06T04:18:02.046243885Z I1106 04:18:02.046205       1 status.go:313] 1/6 Profiles with bootcmdline conflict
2024-11-06T04:18:02.050944222Z E1106 04:18:02.050899       1 status.go:70] unable to update ClusterOperator: Operation cannot be fulfilled on clusteroperators.config.openshift.io "node-tuning": the object has been modified; please apply your changes to the latest version and try again
2024-11-06T04:18:02.050944222Z E1106 04:18:02.050925       1 controller.go:198] unable to sync(profile/openshift-cluster-node-tuning-operator/dciokd-master-1) requeued (1): failed to sync Profile dciokd-master-1: failed to sync OperatorStatus: Operation cannot be fulfilled on clusteroperators.config.openshift.io "node-tuning": the object has been modified; please apply your changes to the latest version and try again
2024-11-06T04:18:02.053784567Z I1106 04:18:02.052457       1 status.go:313] 1/6 Profiles with bootcmdline conflict
2024-11-06T04:18:02.063353278Z I1106 04:18:02.062462       1 status.go:313] 1/6 Profiles with bootcmdline conflict

Expected results:

node-tuning should not be degraded 100% times

Additional info:

Deployments were made using Distributed-CI, here we have all the cases where we detected this issue. In the provided links, you can find the must-gather of the cluster where the issue appeared in the Files section.

- OpenShift 4.18 nightly 2024-11-01 05:41 - https://www.distributed-ci.io/jobs/19b50f0b-9d67-4151-80fe-efe766d7c8eb/files
- OpenShift 4.18 nightly 2024-11-05 16:40 - https://www.distributed-ci.io/jobs/cc8a28af-8468-4511-96e7-9e5a6b2ad7a1/files
- OpenShift 4.18 nightly 2024-11-21 13:21 - https://www.distributed-ci.io/jobs/4883723f-73e4-47dc-be7a-04cd61dcf619/files
- OpenShift 4.17 nightly 2024-12-19 07:52 - https://www.distributed-ci.io/jobs/21187a71-543d-4782-83f9-876fc106f2e6/files
- OpenShift 4.19.0 ec.0 - https://www.distributed-ci.io/jobs/b428a278-906e-41f2-93e5-a7e3705472e4/files
- OpenShift 4.19 nightly 2024-12-23 18:24 - https://www.distributed-ci.io/jobs/c529fc65-a5b6-44f8-9cd4-567ddb189974/files
- OpenShift 4.17 nightly 2024-12-29 13:27 - https://www.distributed-ci.io/jobs/bf5b12c3-641d-4d43-b817-2650ebf2ddfc/files
- OpenShift 4.19 nightly 2024-12-31 03:14 - https://www.distributed-ci.io/jobs/e5d20de6-6f4e-48a6-bd9c-4c733d5133cc/files
- OpenShift 4.17 nightly 2024-12-31 04:58 - https://www.distributed-ci.io/jobs/9c022da7-aa23-4694-81b3-f529d9d05977/files

depends on

RHEL-75773 TuneD race when hotplugging block devices results in garbled TUNED_BOOT_CMDLINE

Release Pending

Assignee:: Jiri Mencak

Reporter:: Ramon Perez

QA Contact:: Liquan Cui

Votes:: 0 Vote for this issue

Watchers:: 10 Start watching this issue

Created:: 2025/01/02 11:28 AM

Updated:: 2025/04/24 6:39 AM

Details

Description

Attachments

Issue Links

Easy Agile Planning Poker

Activity

People

Dates