-
Task
-
Resolution: Done
-
Major
-
None
-
None
-
False
-
None
-
False
-
NEW
-
NEW
-
MON Sprint 246
In OCP 4.10, the PrometheusOperatorRejectedResources alert (along with other alerts related to Prometheus operator) has been extended to cover the openshift-user-workload-monitoring namespace.
The CCX team has seen that about 5% of 4.10 clusters have the alert firing for openshift-user-workload-monitoring. In practice it means that some of the user-defined pod/service monitors aren't valid (like invalid scrape interval values or references to missing secrets for scrape authentication).
Eventually we want the Prometheus operator to be more user-friendly and provide direct feedback to the users:
1. Do more with OpenAPI spec validations
2. Implement/configure validating webhooks for things that can't be modeled directly with OpenAPI.
3. Implement the status subresource for service/pod monitors.
But in the mean time, the alert description should be improved to include more details about the cause and how to mitigate the issue. In the same way, we need to add a runbook in github.com/openshift/runbooks and link it in the CMO alert.
[1] https://github.com/openshift/cluster-monitoring-operator/pull/1370
DoD
- Improved the description & summary annotations of the upstream alerts.
- Dedicated runbook in openshift/runbooks.
- Everything pulled together in CMO.
- is documented by
-
OBSDOCS-390 Edit content for new PrometheusOperatorRejectedResources runbook
- Closed
- relates to
-
OCPBUGS-36406 PrometheusOperatorRejectedResources should link its runbook
- Closed
- links to