-
Bug
-
Resolution: Done
-
Normal
-
None
-
None
-
Quality / Stability / Reliability
-
0.42
-
False
-
-
False
-
None
-
-
Moderate
-
Customer Reported
-
None
Description of problem:
When both the OpenShift Virtualization and Kube Descheduler operators are installed, the HCOMisconfiguredDescheduler alert fires as expected. The provided runbook correctly instructs the administrator to add spec.profileCustomizations.devEnableEvictionsInBackground: true to the KubeDescheduler custom resource. RunBook: https://github.com/openshift/runbooks/blob/master/alerts/openshift-virtualization-operator/HCOMisconfiguredDescheduler.md After applying this configuration, the descheduler pod restarts and correctly applies the --feature-gates=EvictionsInBackground=true flag. However, the HCOMisconfiguredDescheduler alert does not clear. The alert only resolves after an administrator manually restarts the hco-operator pod in the openshift-cnv namespace. This indicates that the hco-operator is not dynamically reconciling the state of the descheduler configuration after the change is applied, leading to a persistent and misleading alert.
Version-Release number of selected component (if applicable):
OpenShift Version 4.19.7 Kube Descheduler Operator Version 5.2.0 OpenShift Virtualization Operator Version 4.19.1
How reproducible:
Always
Steps to Reproduce:
1. Deploy an OpenShift cluster with the OpenShift-Virtualization Operator and create the instance for HCO and the Kube Descheduler Operator with KubeDescheduler instance.
2. Wait for the HCOMisconfiguredDescheduler alert to begin firing.
3. Follow the official runbook and edit the KubeDescheduler CR to add the required configuration:
spec:
profileCustomizations:
devEnableEvictionsInBackground: true
4. Verify that the descheduler pod in the openshift-kube-descheduler-operator namespace restarts.
5. Inspect the restarted descheduler pod's YAML and confirm it now includes the --feature-gates=EvictionsInBackground=true argument.
6. Observe the cluster's alerts in the monitoring dashboard.
7. Manually delete the hco-operator pod to force a restart. This will cause the operator to re-evaluate the configuration and clear the alert.
$ oc delete pod -n openshift-cnv -l name=hyperconverged-cluster-operator
Actual results:
The HCOMisconfiguredDescheduler alert continues to fire indefinitely, even though the descheduler itself is now correctly configured and running with the required feature gate.
Expected results:
The HCOMisconfiguredDescheduler alert should clear automatically within a few minutes after the KubeDescheduler custom resource is correctly configured and the descheduler pod has successfully restarted. A manual restart of the hco-operator pod should not be required.
Additional info:
The hco-operator appears to be caching the metric or state that triggers the HCOMisconfiguredDescheduler alert. It does not seem to have a reconciliation loop that re-evaluates the descheduler's startup flags or status when the KubeDescheduler CR is updated. The stale, incorrect metric state is only cleared when the hco-operator pod is restarted, forcing it to re-read the configuration upon startup.