Loading...

XML

Word

Printable

Type: Task
Resolution: Done
Priority: Normal
Fix Version/s: OpenShift 4.11 Freeze
Affects Version/s: None
Component/s: Monitoring
Labels:
- stretch-goal

Sprint:
devex docs #222 Jul 21-Aug 11, devex docs #223 Aug 11-Sep 1
Story Points:
3
Affects:

Documentation (Ref Guide, User Guide, etc.)
Release Note Text:
undefined
[QE] How to address?:
---
[QE] Why QE missed?:
---

We need to document how to set the new CMO config map option to activate the body size limit on metrics scraping: prometheusK8s.enforceBodySizeLimit

This new content should also mention that, when enabled, the setting can be used to trigger the alert PrometheusScrapeBodySizeLimitHit (more info about the alert can be found in the runbook for this alert here: https://github.com/openshift/runbooks/pull/47).

When enabled, this option limits the impact that a malicious target can have on Prometheus and the cluster as a whole.

Dev background
The context behind this is briefly mentioned in https://issues.redhat.com/browse/MON-1837, but the goal of the setting is to enforce a global body_size_limit for the platform Prometheuses depending on the size of the cluster so that we can limit how many metrics are ingested by Prometheus. We noticed that `sample_limit` does not completely protect against targets exposing millions and millions of series which would result in a scrape request of hundreds of megabytes. Prometheus would not have enough RAM available to fully ingest this request which would result in Prometheus running out of memory and the node going down even though there are mechanisms in place in the kernel / kubelet to prevent that.

A heuristic that spasquie came up with would be to multiply the estimated maximum number of samples that the more expensive target as based on the data we collect from https://issues.redhat.com/browse/MON-1637 + a certain margin by 200 which is on estimated size in bytes of a sample + a certain margin of error.

In addition, since this is very sensitive and if we get the maths wrong we might end up breaking clusters, it would be great to add a field to CMO's config to disable the limit in case a cluster-admin runs into an unexpected issue and knows that they setup is correct. That would at least provide them a way to recover, although that would leave them in a potentially dangerous situation.

DoD:

Configure enforce_body_size_limit in CMO based on the following heuristic:
- 200 * (max_number_of_samples_per_target | depending on cluster size)
Add a field in CMO's config to disable enforce_body_size_limit

documents

MON-1838 Enforce body_size_limit

Closed

links to

openshift/openshift-docs#49065: RHDEVDoCS-4036 - CMO config map option for body size limit for metrics scraping

openshift/openshift-docs#49182: [enterprise-4.11] RHDEVDoCS-4036 - CMO config map option for body size limit for metrics scraping

openshift/openshift-docs#49183: [enterprise-4.12] RHDEVDoCS-4036 - CMO config map option for body size limit for metrics scraping

Assignee:: Brian Burt

Reporter:: Brian Burt

QA Contact:: Junqi Zhao

Votes:: 0 Vote for this issue

Watchers:: 1 Start watching this issue

Created:: 2022/04/28 12:51 PM

Updated:: 2022/08/16 7:50 PM

Resolved:: 2022/08/16 7:50 PM

Details

Description

Attachments

Issue Links

Easy Agile Planning Poker

Activity

People

Dates