-
Bug
-
Resolution: Won't Do
-
Normal
-
None
-
4.13.z
-
Moderate
-
No
-
Rejected
-
False
-
Description of problem:
Customer-installed Prometheus Operator-based monitoring stacks collide with and attempt to control resources of Managed Openshift Prometheus Operator monitoring stacks (cluster-monitoring-operator, User Workload Monitoring, RHOBS). In almost all the cases this has been shown to be customer Prometheus operators installed unrestricted to a set of namespaces. These then identify the `prometheus` custom resources from the managed operators, and attempt to control those resources. This results in the `prometheus-k8s-1` and `alertmanager-main-1` pods cycling through status pending, init, running, and then deleting as the operators fight with one another. In practical terms there are stretches where the cluster is essentially unmonitored, with SRE teams unable to receive alerts from the cluster, or the cluster appearing to have disappeared due to triggered alerts from DeadMansSnitch. With no insight into what is happening on-cluster, SRE and the customer risk service degredation or outages for these customers.
Version-Release number of selected component (if applicable):
4.x
How reproducible:
100%
Steps to Reproduce:
1. Install unrestricted Prometheus operator stack (Example: Prometheus operator bundled with Cisco Service Mesh management - https://www.cisco.com/c/en/us/products/collateral/cloud-systems-management/intersight/nb-06-service-mesh-mgr-aag-cte-en.html
Actual results:
Cluster monitoring operator, UWM, RHOBS operators are degraded or down, in addition to the customer-installed monitoring operator. Prometheus and Alertmanager replica sets are degraded as pods are created and killed. Alerts from the cluster are sometimes suppressed, and DeadMansSnitch check-in failures are frequent.
Expected results:
Managed monitoring operator stacks are unimpacted by most, if not all, customer-installed Prometheus operators.
Additional info:
I understand it is impossible to restrict the actions of customers who have been granted cluster-admin, and in an ideal world the policy and responsibility matrix would be enough to inhibit these issues. However, in the interests of customer experience, it would be beneficial to prevent these operator collisions with some mix of hardening and obfuscating managed monitoring solutions so out-of-the-box Prometheus operator stacks do not interfere, even if they're unconfigured or inexpertly configured. Perhaps the names of the CRs could be extended/renamed to something like "RH-managed-prometheus" or have a randomized string, and RBAC rules hardened to prevent other operators from being able to read/write managed resources. If the lowest hanging fruit could be addressed here, it would probably solve 85% of the issues we see in this area.
- is duplicated by
-
OCPBUGS-13139 thanos-ruler-user-workload in crashloop, missing value for eval-interval parameter
- Closed
- relates to
-
RFE-4733 Allow installation of Prometheus operators without touching built-in CRDs
- Under Review