Uploaded image for project: 'OpenShift Bugs'
  1. OpenShift Bugs
  2. OCPBUGS-45472

rendered machine config fails to apply when performance profile contains very big list of cpus

XMLWordPrintable

    • None
    • False
    • Hide

      None

      Show
      None
    • Hide
      Cause: Using a long string of isolated cpus in the Performance Profile

      Issue: The long string (imagine 512 cpus spelled out one by one) was passed to tuned, MCO and rpm-ostree and failed to be processed due to its length. The system looked all fine, no errors were reported, but the kernel arguments were missing. All of them!

      Fix: The user input is normalized and minimized to use sequences. So an "0,1, ... ,512" input is now converted into "0-512" internally and on the kernel cmdline.

      Result: Even though it is still possible to construct an argument with this issue (imagine listing just odd CPU IDs 1,3,5,..,511), for most of the common cases the processing should now work fine and the kernel arguments should be applied.
      Show
      Cause: Using a long string of isolated cpus in the Performance Profile Issue: The long string (imagine 512 cpus spelled out one by one) was passed to tuned, MCO and rpm-ostree and failed to be processed due to its length. The system looked all fine, no errors were reported, but the kernel arguments were missing. All of them! Fix: The user input is normalized and minimized to use sequences. So an "0,1, ... ,512" input is now converted into "0-512" internally and on the kernel cmdline. Result: Even though it is still possible to construct an argument with this issue (imagine listing just odd CPU IDs 1,3,5,..,511), for most of the common cases the processing should now work fine and the kernel arguments should be applied.
    • Bug Fix
    • In Progress

      This is a clone of issue OCPBUGS-45264. The following is the description of the original issue:

      Description of problem:

          When Applying profile with isolated field containing huge cpu  list, profile doesn't apply and no errors is reported 

      Version-Release number of selected component (if applicable):

          4.18.0-0.nightly-2024-11-26-075648

      How reproducible:

          Everytime.

      Steps to Reproduce:

          1. Create a profile as specified below:
      apiVersion: performance.openshift.io/v2
      kind: PerformanceProfile
      metadata:
        annotations:
          kubeletconfig.experimental: '{"topologyManagerPolicy":"restricted"}'
        creationTimestamp: "2024-11-27T10:25:13Z"
        finalizers:
        - foreground-deletion
        generation: 61
        name: performance
        resourceVersion: "3001998"
        uid: 8534b3bf-7bf7-48e1-8413-6e728e89e745
      spec:
        cpu:
          isolated: 25,281,117,373,127,383,3,259,30,286,77,333,23,279,21,277,66,322,12,268,15,271,61,317,120,376,35,291,62,318,93,349,126,382,19,275,52,308,110,366,50,306,92,348,124,380,119,375,2,258,29,285,60,316,115,371,118,374,104,360,108,364,70,326,72,328,76,332,96,352,99,355,64,320,80,336,97,353,8,264,11,267,38,294,53,309,57,313,103,359,14,270,87,343,7,263,40,296,51,307,94,350,116,372,39,295,46,302,90,346,101,357,107,363,26,282,67,323,98,354,106,362,113,369,6,262,10,266,20,276,33,289,112,368,85,341,121,377,68,324,71,327,79,335,81,337,83,339,88,344,9,265,89,345,91,347,100,356,54,310,31,287,58,314,59,315,22,278,47,303,105,361,17,273,114,370,111,367,28,284,49,305,55,311,84,340,27,283,95,351,5,261,36,292,41,297,43,299,45,301,75,331,102,358,109,365,37,293,56,312,63,319,65,321,74,330,125,381,13,269,42,298,44,300,78,334,122,378,4,260,16,272,34,290,123,379,18,274,48,304,69,325,82,338,24,280,32,288,73,329,86,342,220,476,251,507,206,462,226,482,229,485,221,477,253,509,255,511,135,391,144,400,183,439,233,489,137,393,186,442,198,454,190,446,234,490,147,403,163,419,172,428,148,404,149,405,250,506,151,407,217,473,238,494,131,387,230,486,235,491,246,502,145,401,194,450,199,455,143,399,169,425,231,487,245,501,129,385,142,398,179,435,225,481,236,492,152,408,203,459,214,470,166,422,207,463,212,468,130,386,155,411,215,471,188,444,201,457,210,466,193,449,200,456,248,504,141,397,167,423,191,447,181,437,222,478,252,508,128,384,139,395,174,430,164,420,168,424,187,443,232,488,133,389,157,413,208,464,140,396,185,441,241,497,219,475,175,431,184,440,213,469,154,410,197,453,249,505,209,465,218,474,227,483,244,500,134,390,153,409,178,434,160,416,195,451,196,452,211,467,132,388,136,392,146,402,138,394,150,406,239,495,173,429,192,448,202,458,205,461,216,472,158,414,159,415,176,432,189,445,237,493,242,498,177,433,182,438,204,460,240,496,254,510,162,418,171,427,180,436,243,499,156,412,165,421,170,426,228,484,247,503,161,417,223,479,224,480
          reserved: 0,256,1,257
        hugepages:
          defaultHugepagesSize: 1G
          pages:
          - count: 20
            size: 2M
        machineConfigPoolSelector:
          machineconfiguration.openshift.io/role: worker-cnf
        net:
          userLevelNetworking: true
        nodeSelector:
          node-role.kubernetes.io/worker-cnf: ""
        numa:
          topologyPolicy: restricted
        realTimeKernel:
          enabled: false
        workloadHints:
          highPowerConsumption: true
          perPodPowerManagement: false
          realTime: true
      
          2. The worker-cnf node doesn't contain any kernel args associated with the above profile.
          3.
          

      Actual results:

          System doesn't boot with kernel args associated with above profile

      Expected results:

          System should boot with Kernel args presented from Performance Profile.

      Additional info:

      We can see MCO gets the details and creates the mc:
      
      Dec 02 08:59:43 cnfdd11.t5g-dev.eng.rdu2.dc.redhat.com root[18779]: machine-config-daemon[9550]: "Running rpm-ostree [kargs --delete=systemd.unified_cgroup_hierarchy=1 --delete=cgroup_no_v1=\"all\" --delete=psi=0 --delete=skew_tick=1 --delete=tsc=reliable --delete=rcupda>
      Dec 02 08:59:43 cnfdd11.t5g-dev.eng.rdu2.dc.redhat.com root[18779]: cbs=25,281,117,373,127,383,3,259,30,286,77,333,23,279,21,277,66,322,12,268,15,271,61,317,120,376,35,291,62,318,93,349,126,382,19,275,52,308,110,366,50,306,92,348,124,380,119,375,2,258,29,285,60,316,115,3>
      Dec 02 08:59:43 cnfdd11.t5g-dev.eng.rdu2.dc.redhat.com root[18779]: 4,190,446,234,490,147,403,163,419,172,428,148,404,149,405,250,506,151,407,217,473,238,494,131,387,230,486,235,491,246,502,145,401,194,450,199,455,143,399,169,425,231,487,245,501,129,385,142,398,179,435,2>
      Dec 02 08:59:43 cnfdd11.t5g-dev.eng.rdu2.dc.redhat.com root[18779]: systemd.cpu_affinity=0,1,256,257 --append=iommu=pt --append=amd_pstate=guided --append=tsc=reliable --append=nmi_watchdog=0 --append=mce=off --append=processor.max_cstate=1 --append=idle=poll --append=is>
      Dec 02 08:59:43 cnfdd11.t5g-dev.eng.rdu2.dc.redhat.com root[18779]: ,78,334,122,378,4,260,16,272,34,290,123,379,18,274,48,304,69,325,82,338,24,280,32,288,73,329,86,342,220,476,251,507,206,462,226,482,229,485,221,477,253,509,255,511,135,391,144,400,183,439,233,489,137,393>
      Dec 02 08:59:43 cnfdd11.t5g-dev.eng.rdu2.dc.redhat.com root[18779]: 510,162,418,171,427,180,436,243,499,156,412,165,421,170,426,228,484,247,503,161,417,223,479,224,480 --append=nohz_full=25,281,117,373,127,383,3,259,30,286,77,333,23,279,21,277,66,322,12,268,15,271,61,317>
      Dec 02 08:59:43 cnfdd11.t5g-dev.eng.rdu2.dc.redhat.com root[18779]: ,476,251,507,206,462,226,482,229,485,221,477,253,509,255,511,135,391,144,400,183,439,233,489,137,393,186,442,198,454,190,446,234,490,147,403,163,419,172,428,148,404,149,405,250,506,151,407,217,473,238,49>
      Dec 02 08:59:43 cnfdd11.t5g-dev.eng.rdu2.dc.redhat.com root[18779]: ppend=nosoftlockup --append=skew_tick=1 --append=rcutree.kthread_prio=11 --append=default_hugepagesz=1G --append=hugepagesz=2M --append=hugepages=20]"
      Dec 02 08:59:43 cnfdd11.t5g-dev.eng.rdu2.dc.redhat.com rpm-ostree[18750]: client(id:machine-config-operator dbus:1.336 unit:crio-36c845a9c9a58a79a0e09dab668f8b21b5e46e5734a527c269c6a5067faa423b.scope uid:0) added; new total=1
      Dec 02 08:59:43 cnfdd11.t5g-dev.eng.rdu2.dc.redhat.com rpm-ostree[18750]: Loaded sysroot
      
      Actual Kernel args:
      BOOT_IMAGE=(hd1,gpt3)/boot/ostree/rhcos-854dd632224b34d5f4df1884c4ba8c2f9527422b37744b83e7b1b98172586ff4/vmlinuz-5.14.0-427.44.1.el9_4.x86_64 rw ostree=/ostree/boot.0/rhcos/854dd632224b34d5f4df1884c4ba8c2f9527422b37744b83e7b1b98172586ff4/0 ignition.platform.id=metal ip=dhcp root=UUID=0068e804-432c-409d-aabc-260aa71e3669 rw rootflags=prjquota boot=UUID=7797d927-876e-426b-9a30-d1e600c1a382 systemd.unified_cgroup_hierarchy=1 cgroup_no_v1=all psi=0 skew_tick=1 tsc=reliable rcupdate.rcu_normal_after_boot=1 nohz=on
      
          

              msivak@redhat.com Martin Sivak
              openshift-crt-jira-prow OpenShift Prow Bot
              Gowrishankar Rajaiyan Gowrishankar Rajaiyan
              Votes:
              0 Vote for this issue
              Watchers:
              3 Start watching this issue

                Created:
                Updated: