Uploaded image for project: 'OpenShift Monitoring'
  1. OpenShift Monitoring
  2. MON-1837

[upstream] support body size limit in Prometheus operator

    XMLWordPrintable

Details

    • False
    • False
    • NEW
    • NEW
    • undefined
    • Monitoring - Sprint 207
    • 0

    Description

      A new feature was added to Prometheus v2.28.0 which adds the ability to the size of the body that a scrape can have. This is currently experimental upstream but can be really useful for resiliency purposes downstream.

      The exact goal of the feature is to add a safety net to prevent Prometheus from killing a node because of a malicious target. During the rebase of Kubernetes 1.22, it was noticed that Kubernetes added a namespace label to a metric that in consequence caused cardinality explosion. This particular metric was responsible for more than a million series that caused Prometheus to run out of memory when scraping the target. This means that even if we were to add a `sample_limit`, we wouldn't be able to prevent this from happening since the check happens after ingestion. A solution would be to cap the maximum ingestion with `body_size_limit`  based on the size of the cluster.

      Ref to the incident: https://coreos.slack.com/archives/C02989F3P0V/p1627557636209200

      DoD:

      • Add bodySizeLimit to ServiceMonitor/PodMonitor/Probes CRDs
      • Add enforcedBodySizeLimit to Prometheus CRD

      Attachments

        Issue Links

          Activity

            People

              janantha@redhat.com Jayapriya Pai
              dgrisonn@redhat.com Damien Grisonnet
              Votes:
              0 Vote for this issue
              Watchers:
              3 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: