XML

Word

Printable

Type: Epic
Resolution: Done
Priority: Critical
Fix Version/s: openshift-4.9
Affects Version/s: None
Component/s: None
Labels:
- cee-training
- doc-ack
- groomed
- needs-design
- pm-request
- px-ack
- qe-ack

Epic Name:
Lean Monitoring Stack
Blocked:
False
Ready:
False
Docs QE Status:
NEW
Epic Status:
Done
Feature Link:
TELCOSTRAT-87 - Single Core CPU CaaS Budget for DU Deployment w/ Single-Node OpenShift on Sapphire Rapids Platform
Hierarchy Progress Bar:

0% To Do, 0% In Progress, 100% Done
Release Note Text:
Undefined
Product Sponsor:
Telco 5G RAN

SFDC Cases Links:
SFDC Cases Counter:
SFDC Cases Open:

Market:

Goals

Expose a mechanism to allow the Monitoring stack to be more a "collect and forward" stack instead of a full E2E Monitoring solution.
Expose a corresponding configuration to allow sending alerts to a remote Alertmanager in case a local Alertmanager is not needed.
- Support proxy environments with also proxy envs.
The overall goal is to fit all platform components into 1 core (2 HTs, 2 CPUs) for single node openshift deployments. The monitoring stack is one of the largest cpu consumers on single node openshift consuming ~ 200 mc at steady state, primarilty prometheus and the node exporter. This epic would track optimizations to the monitoring stack to reduce this usage as much as possible. Two items to be explored:
- Reducing the scrape interval
- Reducing the number of series to be scraped

Non-Goals

Switching off all Monitoring components.
Reducing metrics from any component not owned by the Monitoring team.

Motivation

Currently, OpenShift Monitoring is a full E2E solution for monitoring infrastructure and workloads locally inside a single cluster. It comes with everything that an SRE needs from allowing to configure scraping of metrics to configuring where alerts go.

With deployment models like Single Node OpenShift and/or resource restricted environments, we now face challenges that a lot of the functions are already available centrally or are not necessary due to the nature of a specific cluster (e.g. Far Edge). Therefore, you don't need to deploy components that expose these functions.

Also, Grafana is not FIPS compliant, because it uses PBKDF2 from x/crypto to derive a 32 byte key from a secret and salt, which is then used as the encryption key. Quoting https://bugzilla.redhat.com/show_bug.cgi?id=1931408#c10 "it may be a problem to sell Openshift into govt agencies if
grafana is a required component."

Alternatives

We could make the Monitoring stack as is completely optional and provide a more "agent-like" component. Unfortunately, that would probably take much more time and in the end just reproduce what we already have just with fewer components. It would also not reduce the amount of samples scraped which has the most impact on CPU usage.

Acceptance Criteria

Verify that all alerts fire against a remote Alertmanager when a user configures that option.
Verify that Alertmanager is not deployed when a user configures that option in the cluster-monitoring-operator configmap.
Verify that if you have a local Alertmanager deployed and a user decides to use a remote Alertmanager, the Monitoring stack sends alerts to both destinations.
Verify that Grafana is not deployed when a user configures that option in the cluster-monitoring-operator configmap.
Verify that Prometheus fires alerts against an external Alertmanager in proxy environments (1) configure proxy settings inside CMO and (2) cluster-wide proxy settings through ENV.

Risk and Assumptions

Documentation Considerations

Any additions to our ConfigMap API and their possible values.

Open Questions

~~If we set a URL for a remote Alertmanager, how are we handle authentication?~~
Configuration of remote Alertmanagers would support whatever Prometheus supports (basic auth, client TLS auth and bearer token)

Additional Notes

Assignee:: Simon Pasquier

Reporter:: Christian Heidenreich (Inactive)

QA Contact:: Junqi Zhao

Votes:: 0 Vote for this issue

Watchers:: 14 Start watching this issue

Created:: 2021/03/19 5:23 AM

Updated:: 2022/08/26 2:27 PM

Resolved:: 2021/09/08 3:03 PM

Details

Description

Goals

Non-Goals

Motivation

Alternatives

Acceptance Criteria

Risk and Assumptions

Documentation Considerations

Open Questions

Additional Notes

Attachments

Easy Agile Planning Poker

Activity

People

Dates