Uploaded image for project: 'Performance and Scale for AI Platforms'
  1. Performance and Scale for AI Platforms
  2. PSAP-1037

the warn message won't disappear in co/node-tuning when scale down machinest

XMLWordPrintable

    • Icon: Bug Bug
    • Resolution: Done
    • Icon: Undefined Undefined
    • July Release for PSAP
    • openshift-4.13
    • NTO
    • None
    • False
    • False
    • None

      Add different cpus worker node into the same mcp, you will see the warn message "Profiles with bootcmdline conflict" when executing <oc get co/node-tuning>, the message will disappear when removing one worker node from mcp. but sometime the warn message won't disappear. 

      How to re-produce the issue:

      1. label current one worker node as <node-role.kubernetes.io/worker-diffcpus=> as first node
      2. add a new machineset MACHINE_SET=`oc get machineset -n openshift-machine-api |grep worker | awk '{print $1}' | sort | tail -1`
        oc get machineset $MACHINE_SET -n openshift-machine-api -o json>./differnode.json
        cat ./differnode.json |grep $MACHINE_SET
        echo "Replace name "
        sed -i "s/$MACHINE_SET/openshift-psap-qe-gpu-node01/g"./differnode.json
        grep openshift-psap-qe-gpu-node01 ./differnode.json
        sed -i 's/"instanceType":.*/"instanceType": "m6i.2xlarge",/' ./differnode.json
        sed -i 's/"replicas": 2,/"replicas": 1,/' ./differnode.json
        grep instanceType ./differnode.json
        oc get machines -n openshift-machine-api |grep m6i.2xlarge
      3.  oc create -f ./differnode.json
      1. label the new worker node as <node-role.kubernetes.io/worker-diffcpus=> as second worker node
      2. Create mcp 
      3. oc apply f<<EOF
        apiVersion: machineconfiguration.openshift.io/v1
        kind: MachineConfigPool
        metadata:
          name: worker-diffcpus
          labels:
            worker-diffcpus: ""
        spec:
          machineConfigSelector:
            matchExpressions:
              - {key: machineconfiguration.openshift.io/role, operator: In, values: [worker,worker-diffcpus]}
          nodeSelector:
            matchLabels:
              node-role.kubernetes.io/worker-diffcpus: ""
        EOF
      4. Create custom profile 
        apiVersion: tuned.openshift.io/v1
        kind: Tuned
        metadata:
          name: openshift-bootcmdline-cpu
          namespace: openshift-cluster-node-tuning-operator
        spec:
          profile:
          - data: |
              [main]
              summary=Custom OpenShift profile
              [bootloader]
              cmdline=+cpus=${f:exec:/usr/bin/bash:-c:nproc|tr -d '\n'}
            name: openshift-bootcmdline-cpu

        recommend:
        - machineConfigLabels:
            machineconfiguration.openshift.io/role: "worker-diffcpus"
          priority: 20
          profile: openshift-bootcmdline-cpu

      1. check NTO operator pod and oc get co/node-tuning, you will see the WARN message "Profiles with bootcmdline conflict"
      2. unlabel second worker. It will remove from mcp
      3. wait for status of mcp is ready
      4. unabel first worker node and wait for mcp is ready
      5. delete mcp 
      6. Repeat step 1-12 again, the warn message "Profiles with bootcmdline conflict" will not disappear when executing oc get co/node-tuning

              jmencak Jiri Mencak
              rhn-support-liqcui Liquan Cui
              Votes:
              0 Vote for this issue
              Watchers:
              1 Start watching this issue

                Created:
                Updated:
                Resolved: