Uploaded image for project: 'OpenShift Bugs'
  1. OpenShift Bugs
  2. OCPBUGS-47729

OCP 4.17+ | Node Tuning Operator got degraded when creating a PerformanceProfile with "Profiles with bootcmdline conflict" error message

    • Icon: Bug Bug
    • Resolution: Unresolved
    • Icon: Undefined Undefined
    • None
    • 4.17.z, 4.18.z, 4.19.z
    • Node Tuning Operator
    • None
    • None
    • 3
    • False
    • Hide

      None

      Show
      None
    • Hide
      This release fixes a race in TuneD occuring when two threads expanded variables at the same time. Prior to the fix, TuneD would write incorrect data to /etc/tuned/bootcmdline causing the NTO operator block kernel parameter updates for all machines within the Machine Config Pool the affected machine was part of. This would also result in the NTO operator entering a degraded state.
      Show
      This release fixes a race in TuneD occuring when two threads expanded variables at the same time. Prior to the fix, TuneD would write incorrect data to /etc/tuned/bootcmdline causing the NTO operator block kernel parameter updates for all machines within the Machine Config Pool the affected machine was part of. This would also result in the NTO operator entering a degraded state.
    • Bug Fix
    • In Progress

      Description of problem:

      OCP 4.17+ | Node Tuning Operator got degraded when creating a PerformanceProfile with "Profiles with bootcmdline conflict" error message

      Version-Release number of selected component (if applicable):

      Appeared in latest nightlies from OCP 4.17/18/19

      How reproducible:

      The issue is not appearing all the times

      Steps to Reproduce:

          1. Deploy OCP using IPI installer. In all the observed cases, cluster nodes were virtual machines managed by libvirt
          2. Create the following PerformanceProfile
      
      ---
      kind: PerformanceProfile
      apiVersion: "performance.openshift.io/v2"
      metadata:
        name: libvirt-profile
      spec:
        cpu:
          isolated: "6-23"
          reserved: "0-5"
        hugepages:
          pages:
            - size: "1G"
              count: 2
              node: 0
            - size: "2M"
              count: 1000
              node: 0
        numa:
          topologyPolicy: "restricted"
        nodeSelector:
          node-role.kubernetes.io/worker: ""
      ...
      
          3. Check node-tuning operator status

      Actual results:

      node-tuning operator is degraded, showing a message like this: "x/6 Profiles with bootcmdline conflict" (where x could be 1 or 2, at least in the errors we have detected).
      
      From the must-gather and the cluster logs, we could see that:
      
      - All cluster nodes were in Ready status
      - All Tuned resources were in a correct status, and not in degraded state. Message said "TuneD profile applied"
      - There were no issue with MCP or MC resources
      
      These are the logs we could extract from one of the Tuned resources, in one of the failed cases:
      
      2024-11-06T04:18:02.007091488Z E1106 04:18:02.007064       1 controller.go:788] not all 3 Nodes in MCP worker agree on bootcmdline: skew_tick=1 tsc=reliable rcupdate.rcu_normal_after_boot=1 nohz=on rcu_nocbs=6-23 tuned.non_isolcpus=0000003f systemd.cpu_affinity=0,1,2,3,4,5 intel_iommu=on iommu=pt isolcpus=managed_irq,6-23 nohz_full=6-23 tsc=reliable nosoftlockup nmi_watchdog=0 mce=off skew_tick=1 rcutree.kthread_prio=11 intel_pstate=active
      2024-11-06T04:18:02.029044869Z E1106 04:18:02.028479       1 controller.go:788] not all 3 Nodes in MCP worker agree on bootcmdline: >4096active
      2024-11-06T04:18:02.029631427Z I1106 04:18:02.029607       1 status.go:313] 1/6 Profiles with bootcmdline conflict
      2024-11-06T04:18:02.046243885Z I1106 04:18:02.046205       1 status.go:313] 1/6 Profiles with bootcmdline conflict
      2024-11-06T04:18:02.050944222Z E1106 04:18:02.050899       1 status.go:70] unable to update ClusterOperator: Operation cannot be fulfilled on clusteroperators.config.openshift.io "node-tuning": the object has been modified; please apply your changes to the latest version and try again
      2024-11-06T04:18:02.050944222Z E1106 04:18:02.050925       1 controller.go:198] unable to sync(profile/openshift-cluster-node-tuning-operator/dciokd-master-1) requeued (1): failed to sync Profile dciokd-master-1: failed to sync OperatorStatus: Operation cannot be fulfilled on clusteroperators.config.openshift.io "node-tuning": the object has been modified; please apply your changes to the latest version and try again
      2024-11-06T04:18:02.053784567Z I1106 04:18:02.052457       1 status.go:313] 1/6 Profiles with bootcmdline conflict
      2024-11-06T04:18:02.063353278Z I1106 04:18:02.062462       1 status.go:313] 1/6 Profiles with bootcmdline conflict

      Expected results:

      node-tuning should not be degraded 100% times

      Additional info:

      Deployments were made using Distributed-CI, here we have all the cases where we detected this issue. In the provided links, you can find the must-gather of the cluster where the issue appeared in the Files section.
      
      - OpenShift 4.18 nightly 2024-11-01 05:41 - https://www.distributed-ci.io/jobs/19b50f0b-9d67-4151-80fe-efe766d7c8eb/files
      - OpenShift 4.18 nightly 2024-11-05 16:40 - https://www.distributed-ci.io/jobs/cc8a28af-8468-4511-96e7-9e5a6b2ad7a1/files
      - OpenShift 4.18 nightly 2024-11-21 13:21 - https://www.distributed-ci.io/jobs/4883723f-73e4-47dc-be7a-04cd61dcf619/files
      - OpenShift 4.17 nightly 2024-12-19 07:52 - https://www.distributed-ci.io/jobs/21187a71-543d-4782-83f9-876fc106f2e6/files
      - OpenShift 4.19.0 ec.0 - https://www.distributed-ci.io/jobs/b428a278-906e-41f2-93e5-a7e3705472e4/files
      - OpenShift 4.19 nightly 2024-12-23 18:24 - https://www.distributed-ci.io/jobs/c529fc65-a5b6-44f8-9cd4-567ddb189974/files
      - OpenShift 4.17 nightly 2024-12-29 13:27 - https://www.distributed-ci.io/jobs/bf5b12c3-641d-4d43-b817-2650ebf2ddfc/files
      - OpenShift 4.19 nightly 2024-12-31 03:14 - https://www.distributed-ci.io/jobs/e5d20de6-6f4e-48a6-bd9c-4c733d5133cc/files
      - OpenShift 4.17 nightly 2024-12-31 04:58 - https://www.distributed-ci.io/jobs/9c022da7-aa23-4694-81b3-f529d9d05977/files

              jmencak Jiri Mencak
              raperez@redhat.com Ramon Perez
              Liquan Cui Liquan Cui
              Votes:
              0 Vote for this issue
              Watchers:
              10 Start watching this issue

                Created:
                Updated: