This is the 4.11 targeted work to improve resilience

This instance only contains two work items. Both items improve the resilience of Prometheus instances when faced with overly large scrape results. This will be achieved by setting limits on the number of labels that can be ingested per scrape and limiting the body size of a scrape request.

Epic Goal

Investigate and implement more resiliency mechanisms into OpenShift Monitoring to protect the stack from bad actors.

This epic is here to collect all efforts that are not yet planned for a particular version. If any tasks are planed for a release, create a new epic and move tasks accordingly. Before this we had an Epic that spanned multiple releases, but that turned out to be confusing for many stakeholders, see the original epic from which this is cloned.

Why is this important?

More teams than ever leverage our infrastructure in different areas:

Telemetry - Number of metrics is increasing exponential and the plan is to have more next year for not only business data, but to also serve Subscription and Cost Management.
OSD/Addons - With the introduction of UWM, OSD has now the ability to provide a monitoring stack to their customer which they will mostly own. As we move to a managed services model with Red Hat, UWM can't be used for Addons (managed services) or by OSD for "custom monitoring" as they will not be able to control it (the customer will).
More teams adding Monitoring to their components - We already see a growing number of new rules, dashboards, and other UX inside the Console requesting data from our infrastructure.

Currently, we still "Watchdog" most requests as much as possible since we don't want to see too much impact on our stack. That obviously does not scale and should not be the only goal for the Engineering team.

Not to say but what ever we do here, will also be valuable to customers and UWM since they will struggle with the same issues having huge multi-tenant environment and not room for much "manual governance".

Scenarios

More data coming in from different components that may also introduce a higher level of cardinality.
More teams with their own Monitoring and a multi-tenant experience inside the Console, all requesting metrics or, with rules, processing queries.
Customers attaching their own Grafana.

Acceptance Criteria

Demonstrate improved performance or reliability.

Docs

TBD

Assignee:: Unassigned

Reporter:: Robert Krátký (Inactive)

Votes:: 0 Vote for this issue

Watchers:: 2 Start watching this issue

Created:: 2022/03/30 1:20 PM

Updated:: 2022/06/27 2:45 PM

Resolved:: 2022/06/27 2:45 PM

Details

Description

This is the 4.11 targeted work to improve resilience

Epic Goal

Why is this important?

Scenarios

Acceptance Criteria

Docs

Attachments

Easy Agile Planning Poker

Activity

People

Dates