Uploaded image for project: 'OpenShift Monitoring'
  1. OpenShift Monitoring
  2. MON-3352

Detect multiple operators managing the same prometheus/alertmanager/thanosruler custom resource

XMLWordPrintable

    • Icon: Task Task
    • Resolution: Done
    • Icon: Normal Normal
    • None
    • None
    • CMO
    • None
    • 5
    • False
    • None
    • False
    • NEW
    • NEW
    • MON Sprint 246, MON Sprint 247, MON Sprint 249

      Though the documentation warns about it, users sometimes deploy a 3rd-party Prometheus operator which creates conflict with the Prometheus operators running in the openshift-monitoring and/or openshift-user-workload-monitoring namespaces. Concretely the additional operator may try to update the Prometheus statefulset and the pods never reach the ready state (https://prometheus-operator.dev/docs/operator/troubleshooting/#prometheusalertmanager-pods-stuck-in-terminating-loop-with-healthy-start-up-logs). Such situation is hard to troubleshoot if you're not familiar with it and it would be helpful to identify and report the root cause to the cluter admins.

      There's little help we can expect from alerts because when it happens, Prometheus and/or Alertmanager are down anyway. One option would be to report the issue via CMO's cluster operator resource. CMO could watch the managed statefulsets and track the .metadata.generation field: if it sees that the value increases over a short period of time, it's a sure sign that a rogue controller is conflicting with CMO.

      - apiVersion: config.openshift.io/v1
        kind: ClusterOperator
        metadata: 
          name: monitoring
        spec: {}
        status: 
          conditions: 
          - lastTransitionTime: '2023-12-21T14:04:05Z'
            message: |-
              updating prometheus-k8s: waiting for Prometheus object changes failed: waiting for Prometheus openshift-monitoring/k8s: expected 2 replicas, got 1 available replicas.
              The generation number of the "prometheus-k8s" statefulset has been updated X times in the last Y seconds. It could be an indication that an external actor (such as another Prometheus operator running in the cluster) conflicts with the OCP monitoring operator.
            reason: UpdatingPrometheusK8SFailed
            status: 'False'
            type: Available
      

              mariofer@redhat.com Mario Fernandez Herrero
              dmellado1@redhat.com Daniel Mellado Area
              Ayoub Mrini, Daniel Mellado Area, Pranshu Srivastava
              Votes:
              0 Vote for this issue
              Watchers:
              4 Start watching this issue

                Created:
                Updated:
                Resolved: