-
Bug
-
Resolution: Unresolved
-
Minor
-
None
-
4.19
-
None
-
Quality / Stability / Reliability
-
False
-
-
None
-
Moderate
-
None
-
None
-
None
-
None
-
None
-
None
-
None
-
None
-
None
-
None
-
None
-
None
Description of problem:
The Kube Descheduler Operator has a new `DevKubeVirtRelieveAndMigrate` profile. To leverage this profile the documents include a MachineConfig to set `psi=1` as a kernel argument. As a resolution of OCPBUGS-37271 we set as a platform default `psi=0`. These are present in the generated MachineConfigs as a default: 97-master-generated-kubelet 97-worker-generated-kubelet This causes a conflict where both arguments are present when querying `/proc/cmdline` - See attached cmdline-output.png When the Descheduler is triggered, it results in an error alert "DeschedulerPSIDisabled" - see attached DeschedulerPSIDisabled.jpg
Version-Release number of selected component (if applicable):
As of 4.17
How reproducible:
Consistently
Steps to Reproduce:
1. Create a MachineConfig for enabling PSI as per the documentation --- apiVersion: machineconfiguration.openshift.io/v1 kind: MachineConfig metadata: labels: machineconfiguration.openshift.io/role: master name: 99-master-kargs-psi spec: kernelArguments: - psi=1
2. Create the example KubeDescheduler CR --- apiVersion: operator.openshift.io/v1 kind: KubeDescheduler metadata: name: cluster namespace: openshift-kube-descheduler-operator spec: managementState: Managed deschedulingIntervalSeconds: 30 mode: "Automatic" profiles: - DevKubeVirtRelieveAndMigrate profileCustomizations: devEnableSoftTainter: true devDeviationThresholds: AsymmetricLow devActualUtilizationProfile: PrometheusCPUCombined
3. Create a VM or few.
4. Trigger an action on the descheduler. This can be done with a CPU Load pod: --- apiVersion: apps/v1 kind: Deployment metadata: name: cpuload spec: selector: matchLabels: app: cpuload replicas: 1 template: metadata: labels: app: cpuload spec: nodeSelector: # change to the hostname of the node running the VM kubernetes.io/hostname: worker-1 containers: - name: container image: 'quay.io/simonkrenger/cpuload:latest' resources: # may need to change limits or replicas to trigger load shedding descheduler action limits: cpu: '8' memory: 1Gi requests: cpu: '8' memory: 1Gi strategy: type: Recreate
Actual results:
Descheduler fails, produces error and alert
Expected results:
VM to move to a more suitable host and rebalancing load
Additional info:
I wonder if this `psi=0` is a too-wide-spread platform default, and the original bug was due to the use of realtime kernel in Telco? If not, then our documentation for `DevKubeVirtRelieveAndMigrate` profile in Kube Descheduler should include a note about the default? Thoughts?
- depends on
-
OCPNODE-3806 Node Team to reconsider enabling PSI Metrics to help CNV Descheduler
-
- In Progress
-