-
Task
-
Resolution: Done
-
Normal
-
None
A new feature was added to Prometheus v2.28.0 which adds the ability to the size of the body that a scrape can have. This is currently experimental upstream but can be really useful for resiliency purposes downstream.
The exact goal of the feature is to add a safety net to prevent Prometheus from killing a node because of a malicious target. During the rebase of Kubernetes 1.22, it was noticed that Kubernetes added a namespace label to a metric that in consequence caused cardinality explosion. This particular metric was responsible for more than a million series that caused Prometheus to run out of memory when scraping the target. This means that even if we were to add a `sample_limit`, we wouldn't be able to prevent this from happening since the check happens after ingestion. A solution would be to cap the maximum ingestion with `body_size_limit` based on the size of the cluster.
Ref to the incident: https://coreos.slack.com/archives/C02989F3P0V/p1627557636209200
DoD:
- Add bodySizeLimit to ServiceMonitor/PodMonitor/Probes CRDs
- Add enforcedBodySizeLimit to Prometheus CRD
- blocks
-
MON-1838 Enforce body_size_limit
- Closed