-
Epic
-
Resolution: Done
-
Critical
-
None
-
None
-
Lean Monitoring Stack
-
False
-
False
-
NEW
-
Done
-
TELCOSTRAT-87 - Single Core CPU CaaS Budget for DU Deployment w/ Single-Node OpenShift on Sapphire Rapids Platform
-
0% To Do, 0% In Progress, 100% Done
-
Undefined
-
Telco 5G RAN
Goals
- Expose a mechanism to allow the Monitoring stack to be more a "collect and forward" stack instead of a full E2E Monitoring solution.
- Expose a corresponding configuration to allow sending alerts to a remote Alertmanager in case a local Alertmanager is not needed.
- Support proxy environments with also proxy envs.
- The overall goal is to fit all platform components into 1 core (2 HTs, 2 CPUs) for single node openshift deployments. The monitoring stack is one of the largest cpu consumers on single node openshift consuming ~ 200 mc at steady state, primarilty prometheus and the node exporter. This epic would track optimizations to the monitoring stack to reduce this usage as much as possible. Two items to be explored:
- Reducing the scrape interval
- Reducing the number of series to be scraped
Non-Goals
- Switching off all Monitoring components.
- Reducing metrics from any component not owned by the Monitoring team.
Motivation
Currently, OpenShift Monitoring is a full E2E solution for monitoring infrastructure and workloads locally inside a single cluster. It comes with everything that an SRE needs from allowing to configure scraping of metrics to configuring where alerts go.
With deployment models like Single Node OpenShift and/or resource restricted environments, we now face challenges that a lot of the functions are already available centrally or are not necessary due to the nature of a specific cluster (e.g. Far Edge). Therefore, you don't need to deploy components that expose these functions.
Also, Grafana is not FIPS compliant, because it uses PBKDF2 from x/crypto to derive a 32 byte key from a secret and salt, which is then used as the encryption key. Quoting https://bugzilla.redhat.com/show_bug.cgi?id=1931408#c10 "it may be a problem to sell Openshift into govt agencies if
grafana is a required component."
Alternatives
We could make the Monitoring stack as is completely optional and provide a more "agent-like" component. Unfortunately, that would probably take much more time and in the end just reproduce what we already have just with fewer components. It would also not reduce the amount of samples scraped which has the most impact on CPU usage.
Acceptance Criteria
- Verify that all alerts fire against a remote Alertmanager when a user configures that option.
- Verify that Alertmanager is not deployed when a user configures that option in the cluster-monitoring-operator configmap.
- Verify that if you have a local Alertmanager deployed and a user decides to use a remote Alertmanager, the Monitoring stack sends alerts to both destinations.
- Verify that Grafana is not deployed when a user configures that option in the cluster-monitoring-operator configmap.
- Verify that Prometheus fires alerts against an external Alertmanager in proxy environments (1) configure proxy settings inside CMO and (2) cluster-wide proxy settings through ENV.
Risk and Assumptions
Documentation Considerations
- Any additions to our ConfigMap API and their possible values.
Open Questions
If we set a URL for a remote Alertmanager, how are we handle authentication?- Configuration of remote Alertmanagers would support whatever Prometheus supports (basic auth, client TLS auth and bearer token)