Loading...

XML

Word

Printable

Type: Task
Resolution: Done
Priority: Normal
Fix Version/s: openshift-4.10
Affects Version/s: None
Component/s: prometheus-operator
Labels:
- groomed
- no-qe
- upstream

Blocked:
False
Ready:
False
Epic Link:
MON-1277
Docs QE Status:
NEW
QE Status:
NEW
Release Note Text:
undefined
Market:

Sprint:
Monitoring - Sprint 207

SFDC Cases Links:
SFDC Cases Counter:
SFDC Cases Open:

A new feature was added to Prometheus v2.28.0 which adds the ability to the size of the body that a scrape can have. This is currently experimental upstream but can be really useful for resiliency purposes downstream.

The exact goal of the feature is to add a safety net to prevent Prometheus from killing a node because of a malicious target. During the rebase of Kubernetes 1.22, it was noticed that Kubernetes added a namespace label to a metric that in consequence caused cardinality explosion. This particular metric was responsible for more than a million series that caused Prometheus to run out of memory when scraping the target. This means that even if we were to add a `sample_limit`, we wouldn't be able to prevent this from happening since the check happens after ingestion. A solution would be to cap the maximum ingestion with `body_size_limit` based on the size of the cluster.

Ref to the incident: https://coreos.slack.com/archives/C02989F3P0V/p1627557636209200

DoD:

Add bodySizeLimit to ServiceMonitor/PodMonitor/Probes CRDs
Add enforcedBodySizeLimit to Prometheus CRD

blocks

MON-1838 Enforce body_size_limit

Closed

Assignee:: Jayapriya Pai

Reporter:: Damien Grisonnet

Votes:: 0 Vote for this issue

Watchers:: 3 Start watching this issue

Created:: 2021/08/19 7:17 PM

Updated:: 2022/09/09 6:23 AM

Resolved:: 2021/09/23 5:39 AM

Details

Description

Attachments

Issue Links

Easy Agile Planning Poker

Activity

People

Dates