-
Task
-
Resolution: Done
-
Normal
-
None
-
None
-
None
Though the documentation warns about it, users sometimes deploy a 3rd-party Prometheus operator which creates conflict with the Prometheus operators running in the openshift-monitoring and/or openshift-user-workload-monitoring namespaces. Concretely the additional operator may try to update the Prometheus statefulset and the pods never reach the ready state (https://prometheus-operator.dev/docs/operator/troubleshooting/#prometheusalertmanager-pods-stuck-in-terminating-loop-with-healthy-start-up-logs). Such situation is hard to troubleshoot if you're not familiar with it and it would be helpful to identify and report the root cause to the cluter admins.
There's little help we can expect from alerts because when it happens, Prometheus and/or Alertmanager are down anyway. One option would be to report the issue via CMO's cluster operator resource. CMO could watch the managed statefulsets and track the .metadata.generation field: if it sees that the value increases over a short period of time, it's a sure sign that a rogue controller is conflicting with CMO.
- apiVersion: config.openshift.io/v1 kind: ClusterOperator metadata: name: monitoring spec: {} status: conditions: - lastTransitionTime: '2023-12-21T14:04:05Z' message: |- updating prometheus-k8s: waiting for Prometheus object changes failed: waiting for Prometheus openshift-monitoring/k8s: expected 2 replicas, got 1 available replicas. The generation number of the "prometheus-k8s" statefulset has been updated X times in the last Y seconds. It could be an indication that an external actor (such as another Prometheus operator running in the cluster) conflicts with the OCP monitoring operator. reason: UpdatingPrometheusK8SFailed status: 'False' type: Available