Loading...

XML

Word

Printable

Type: Task
Resolution: Done
Priority: Normal
Fix Version/s: None
Affects Version/s: None
Component/s: CMO
Labels:
None

Story Points:
5
Blocked:
False
Blocked Reason:
None
Ready:
False
Epic Link:
MON-3251
Docs QE Status:
NEW
QE Status:
NEW
Intelligence Requested:
Market:

Sprint:
MON Sprint 246, MON Sprint 247, MON Sprint 249

Target Version:

openshift-4.16

SFDC Cases Links:
SFDC Cases Counter:
SFDC Cases Open:

Though the documentation warns about it, users sometimes deploy a 3rd-party Prometheus operator which creates conflict with the Prometheus operators running in the openshift-monitoring and/or openshift-user-workload-monitoring namespaces. Concretely the additional operator may try to update the Prometheus statefulset and the pods never reach the ready state (https://prometheus-operator.dev/docs/operator/troubleshooting/#prometheusalertmanager-pods-stuck-in-terminating-loop-with-healthy-start-up-logs). Such situation is hard to troubleshoot if you're not familiar with it and it would be helpful to identify and report the root cause to the cluter admins.

There's little help we can expect from alerts because when it happens, Prometheus and/or Alertmanager are down anyway. One option would be to report the issue via CMO's cluster operator resource. CMO could watch the managed statefulsets and track the .metadata.generation field: if it sees that the value increases over a short period of time, it's a sure sign that a rogue controller is conflicting with CMO.

- apiVersion: config.openshift.io/v1
  kind: ClusterOperator
  metadata: 
    name: monitoring
  spec: {}
  status: 
    conditions: 
    - lastTransitionTime: '2023-12-21T14:04:05Z'
      message: |-
        updating prometheus-k8s: waiting for Prometheus object changes failed: waiting for Prometheus openshift-monitoring/k8s: expected 2 replicas, got 1 available replicas.
        The generation number of the "prometheus-k8s" statefulset has been updated X times in the last Y seconds. It could be an indication that an external actor (such as another Prometheus operator running in the cluster) conflicts with the OCP monitoring operator.
      reason: UpdatingPrometheusK8SFailed
      status: 'False'
      type: Available

- - Sort By Name
  - Sort By Date
  - Ascending
  - Descending
  - Thumbnails
  - List
  - Download All

screen-recording-2023-11-10-at-111640_1nAcX3nP.mp4
1.12 MB
2023/11/10 10:25 AM

links to

openshift/cluster-monitoring-operator#2223: WIP: MON-3352: Add error msg when multiple operators try to access same prometheus resource

openshift/cluster-monitoring-operator#2227: MON-3352: Add error msg when multiple operators try to access same resource

Assignee:: Mario Fernandez Herrero

Reporter:: Daniel Mellado Area

Contributors:: Ayoub Mrini, Daniel Mellado Area, Pranshu Srivastava

Votes:: 0 Vote for this issue

Watchers:: 4 Start watching this issue

Created:: 2023/08/29 9:46 AM

Updated:: 2024/04/03 9:33 AM

Resolved:: 2024/02/28 9:57 AM

Details

Description

Attachments

Attachments

Issue Links

Easy Agile Planning Poker

Activity

People

Dates