Uploaded image for project: 'Docs for Red Hat Developers'
  1. Docs for Red Hat Developers
  2. RHDEVDOCS-4036

Document CMO config map option to activate the body size limit on metrics scraping

XMLWordPrintable

    • devex docs #222 Jul 21-Aug 11, devex docs #223 Aug 11-Sep 1
    • 3
    • Documentation (Ref Guide, User Guide, etc.)
    • undefined
    • ---
    • ---

      We need to document how to set the new CMO config map option to activate the body size limit on metrics scraping: prometheusK8s.enforceBodySizeLimit

      This new content should also mention that, when enabled, the setting can be used to trigger the alert PrometheusScrapeBodySizeLimitHit (more info about the alert can be found in the runbook for this alert here: https://github.com/openshift/runbooks/pull/47).

      When enabled, this option limits the impact that a malicious target can have on Prometheus and the cluster as a whole.

      Dev background
      The context behind this is briefly mentioned in https://issues.redhat.com/browse/MON-1837, but the goal of the setting is to enforce a global body_size_limit for the platform Prometheuses depending on the size of the cluster so that we can limit how many metrics are ingested by Prometheus. We noticed that `sample_limit`  does not completely protect against targets exposing millions and millions of series which would result in a scrape request of hundreds of megabytes. Prometheus would not have enough RAM available to fully ingest this request which would result in Prometheus running out of memory and the node going down even though there are mechanisms in place in the kernel / kubelet to prevent that.

      A heuristic that spasquie came up with would be to multiply the estimated maximum number of samples that the more expensive target as based on the data we collect from https://issues.redhat.com/browse/MON-1637 + a certain margin by 200 which is on estimated size in bytes of a sample + a certain margin of error.

      In addition, since this is very sensitive and if we get the maths wrong we might end up breaking clusters, it would be great to add a field to CMO's config to disable the limit in case a cluster-admin runs into an unexpected issue and knows that they setup is correct. That would at least provide them a way to recover, although that would leave them in a potentially dangerous situation.

      DoD:

      • Configure enforce_body_size_limit in CMO based on the following heuristic:
        • 200 * (max_number_of_samples_per_target | depending on cluster size)
      • Add a field in CMO's config to disable enforce_body_size_limit

              rhn-support-bburt Brian Burt
              rhn-support-bburt Brian Burt
              Junqi Zhao Junqi Zhao
              Votes:
              0 Vote for this issue
              Watchers:
              1 Start watching this issue

                Created:
                Updated:
                Resolved: