Uploaded image for project: 'OpenShift Bugs'
  1. OpenShift Bugs
  2. OCPBUGS-47729

OCP 4.17+ | Node Tuning Operator got degraded when creating a PerformanceProfile with "Profiles with bootcmdline conflict" error message

XMLWordPrintable

    • Icon: Bug Bug
    • Resolution: Unresolved
    • Icon: Undefined Undefined
    • None
    • 4.17.z, 4.18.z, 4.19.z
    • Node Tuning Operator
    • None
    • None
    • False
    • Hide

      None

      Show
      None

      Description of problem:

      OCP 4.17+ | Node Tuning Operator got degraded when creating a PerformanceProfile with "Profiles with bootcmdline conflict" error message

      Version-Release number of selected component (if applicable):

      Appeared in latest nightlies from OCP 4.17/18/19

      How reproducible:

      The issue is not appearing all the times

      Steps to Reproduce:

          1. Deploy OCP using IPI installer. In all the observed cases, cluster nodes were virtual machines managed by libvirt
          2. Create the following PerformanceProfile
      
      ---
      kind: PerformanceProfile
      apiVersion: "performance.openshift.io/v2"
      metadata:
        name: libvirt-profile
      spec:
        cpu:
          isolated: "6-23"
          reserved: "0-5"
        hugepages:
          pages:
            - size: "1G"
              count: 2
              node: 0
            - size: "2M"
              count: 1000
              node: 0
        numa:
          topologyPolicy: "restricted"
        nodeSelector:
          node-role.kubernetes.io/worker: ""
      ...
      
          3. Check node-tuning operator status

      Actual results:

      node-tuning operator is degraded, showing a message like this: "x/6 Profiles with bootcmdline conflict" (where x could be 1 or 2, at least in the errors we have detected).
      
      From the must-gather and the cluster logs, we could see that:
      
      - All cluster nodes were in Ready status
      - All Tuned resources were in a correct status, and not in degraded state. Message said "TuneD profile applied"
      - There were no issue with MCP or MC resources
      
      These are the logs we could extract from one of the Tuned resources, in one of the failed cases:
      
      2024-11-06T04:18:02.007091488Z E1106 04:18:02.007064       1 controller.go:788] not all 3 Nodes in MCP worker agree on bootcmdline: skew_tick=1 tsc=reliable rcupdate.rcu_normal_after_boot=1 nohz=on rcu_nocbs=6-23 tuned.non_isolcpus=0000003f systemd.cpu_affinity=0,1,2,3,4,5 intel_iommu=on iommu=pt isolcpus=managed_irq,6-23 nohz_full=6-23 tsc=reliable nosoftlockup nmi_watchdog=0 mce=off skew_tick=1 rcutree.kthread_prio=11 intel_pstate=active
      2024-11-06T04:18:02.029044869Z E1106 04:18:02.028479       1 controller.go:788] not all 3 Nodes in MCP worker agree on bootcmdline: >4096active
      2024-11-06T04:18:02.029631427Z I1106 04:18:02.029607       1 status.go:313] 1/6 Profiles with bootcmdline conflict
      2024-11-06T04:18:02.046243885Z I1106 04:18:02.046205       1 status.go:313] 1/6 Profiles with bootcmdline conflict
      2024-11-06T04:18:02.050944222Z E1106 04:18:02.050899       1 status.go:70] unable to update ClusterOperator: Operation cannot be fulfilled on clusteroperators.config.openshift.io "node-tuning": the object has been modified; please apply your changes to the latest version and try again
      2024-11-06T04:18:02.050944222Z E1106 04:18:02.050925       1 controller.go:198] unable to sync(profile/openshift-cluster-node-tuning-operator/dciokd-master-1) requeued (1): failed to sync Profile dciokd-master-1: failed to sync OperatorStatus: Operation cannot be fulfilled on clusteroperators.config.openshift.io "node-tuning": the object has been modified; please apply your changes to the latest version and try again
      2024-11-06T04:18:02.053784567Z I1106 04:18:02.052457       1 status.go:313] 1/6 Profiles with bootcmdline conflict
      2024-11-06T04:18:02.063353278Z I1106 04:18:02.062462       1 status.go:313] 1/6 Profiles with bootcmdline conflict

      Expected results:

      node-tuning should not be degraded 100% times

      Additional info:

      Deployments were made using Distributed-CI, here we have all the cases where we detected this issue. In the provided links, you can find the must-gather of the cluster where the issue appeared in the Files section.
      
      - OpenShift 4.18 nightly 2024-11-01 05:41 - https://www.distributed-ci.io/jobs/19b50f0b-9d67-4151-80fe-efe766d7c8eb/files
      - OpenShift 4.18 nightly 2024-11-05 16:40 - https://www.distributed-ci.io/jobs/cc8a28af-8468-4511-96e7-9e5a6b2ad7a1/files
      - OpenShift 4.18 nightly 2024-11-21 13:21 - https://www.distributed-ci.io/jobs/4883723f-73e4-47dc-be7a-04cd61dcf619/files
      - OpenShift 4.17 nightly 2024-12-19 07:52 - https://www.distributed-ci.io/jobs/21187a71-543d-4782-83f9-876fc106f2e6/files
      - OpenShift 4.19.0 ec.0 - https://www.distributed-ci.io/jobs/b428a278-906e-41f2-93e5-a7e3705472e4/files
      - OpenShift 4.19 nightly 2024-12-23 18:24 - https://www.distributed-ci.io/jobs/c529fc65-a5b6-44f8-9cd4-567ddb189974/files
      - OpenShift 4.17 nightly 2024-12-29 13:27 - https://www.distributed-ci.io/jobs/bf5b12c3-641d-4d43-b817-2650ebf2ddfc/files
      - OpenShift 4.19 nightly 2024-12-31 03:14 - https://www.distributed-ci.io/jobs/e5d20de6-6f4e-48a6-bd9c-4c733d5133cc/files
      - OpenShift 4.17 nightly 2024-12-31 04:58 - https://www.distributed-ci.io/jobs/9c022da7-aa23-4694-81b3-f529d9d05977/files

              team-nto Team NTO
              raperez@redhat.com Ramon Perez
              Liquan Cui Liquan Cui
              Votes:
              0 Vote for this issue
              Watchers:
              4 Start watching this issue

                Created:
                Updated: