Uploaded image for project: 'OpenShift Bugs'
  1. OpenShift Bugs
  2. OCPBUGS-62301

Evaluation of platform default kernel psi argument impact and Kube Descheduler Guidance

XMLWordPrintable

    • Icon: Bug Bug
    • Resolution: Unresolved
    • Icon: Minor Minor
    • None
    • 4.19
    • descheduler
    • None
    • Quality / Stability / Reliability
    • False
    • Hide

      None

      Show
      None
    • None
    • Moderate
    • None
    • None
    • None
    • None
    • None
    • None
    • None
    • None
    • None
    • None
    • None
    • None

      Description of problem:

      The Kube Descheduler Operator has a new `DevKubeVirtRelieveAndMigrate` profile.  To leverage this profile the documents include a MachineConfig to set `psi=1` as a kernel argument.
      
      As a resolution of OCPBUGS-37271 we set as a platform default `psi=0`.  These are present in the generated MachineConfigs as a default:
      97-master-generated-kubelet
      97-worker-generated-kubelet
      
      This causes a conflict where both arguments are present when querying `/proc/cmdline` - See attached cmdline-output.png
      
      When the Descheduler is triggered, it results in an error alert "DeschedulerPSIDisabled" - see attached DeschedulerPSIDisabled.jpg

      Version-Release number of selected component (if applicable):

          As of 4.17

      How reproducible:

          Consistently

      Steps to Reproduce:

          1. Create a MachineConfig for enabling PSI as per the documentation
      ---
      apiVersion: machineconfiguration.openshift.io/v1
      kind: MachineConfig
      metadata:
        labels:
          machineconfiguration.openshift.io/role: master
        name: 99-master-kargs-psi
      spec:
        kernelArguments:
        - psi=1

       

          2. Create the example KubeDescheduler CR
      ---
      apiVersion: operator.openshift.io/v1
      kind: KubeDescheduler
      metadata:
        name: cluster
        namespace: openshift-kube-descheduler-operator
      spec:
        managementState: Managed
        deschedulingIntervalSeconds: 30
        mode: "Automatic"
        profiles:
          - DevKubeVirtRelieveAndMigrate
        profileCustomizations:
          devEnableSoftTainter: true
          devDeviationThresholds: AsymmetricLow
          devActualUtilizationProfile: PrometheusCPUCombined
          3. Create a VM or few.
          4. Trigger an action on the descheduler.  This can be done with a CPU Load pod:
      
      ---
      apiVersion: apps/v1
      kind: Deployment
      metadata:
        name: cpuload
      spec:
        selector:
          matchLabels:
            app: cpuload
        replicas: 1
        template:
          metadata:
            labels:
              app: cpuload
          spec:
            nodeSelector:
              # change to the hostname of the node running the VM
              kubernetes.io/hostname: worker-1
            containers:
              - name: container
                image: 'quay.io/simonkrenger/cpuload:latest'
                resources: # may need to change limits or replicas to trigger load shedding descheduler action
                  limits:
                    cpu: '8'
                    memory: 1Gi
                  requests:
                    cpu: '8'
                    memory: 1Gi
        strategy:
          type: Recreate    

      Actual results:

          Descheduler fails, produces error and alert

      Expected results:

          VM to move to a more suitable host and rebalancing load

      Additional info:

      I wonder if this `psi=0` is a too-wide-spread platform default, and the original bug was due to the use of realtime kernel in Telco?
      
      If not, then our documentation for `DevKubeVirtRelieveAndMigrate` profile in Kube Descheduler should include a note about the default?
      
      Thoughts?

              rmarasch@redhat.com Ricardo Maraschini
              kmoini1@redhat.com Ken Moini
              None
              None
              Rama Kasturi Narra Rama Kasturi Narra
              None
              Votes:
              1 Vote for this issue
              Watchers:
              11 Start watching this issue

                Created:
                Updated: