Uploaded image for project: 'Red Hat 3scale API Management'
  1. Red Hat 3scale API Management
  2. THREESCALE-9934

Optimize system-app prometheus metrics to not overload prometheus

    • Icon: Bug Bug
    • Resolution: Unresolved
    • Icon: Major Major
    • None
    • None
    • System
    • None
    • False
    • Hide

      None

      Show
      None
    • False
    • Not Started
    • Not Started
    • Not Started
    • Not Started
    • Not Started
    • Not Started
    • Low

      Right now system-app pods are providing 2 metrics endpoint:

      • /metrics: it was the initial metrics endpoint, it provides a few summary of sidekiq metrics that actually are already better covered by system-sidekiq pods.
      • /yabeda-metrics should be the ones reported at /metrics ones, containing ruby and HTTP metrics (the important ones for system-app). However there are too many metrics that are producing so many prometheus timeseries, that cardinality is so high that produces a huge increase of prometheus memory.

      In the specific case of 3scale SaaS (although any customer with monitoring enabled is affected, worse if the number of system-add pods is high), we needed to raise the prometheus memory limit from 4GB to 10GB, and it was still not enough https://github.com/3scale/platform/pull/1195

      That we finally disabled this /yabeda-metrics scrape from saas-operator until this scrapper endpoint gets optimized at code (porta) level, and it is secure to enable it again without risking prometheus health. With the scrape disabled, prometheus memory decreased inmediately:

      https://github.com/3scale-ops/saas-operator/pull/251

      These are the kind of metrics that is adding for every scrapped system-app pod at /yabeda-metrics endpoint:

      ...
      rails_view_runtime_seconds_sum
      rails_view_runtime_seconds_count
      rails_view_runtime_seconds_bucket
      ...
      

      And then, for each metric there is an independent time series:

      • For every controller (stats/api/services, stats/api/applications, provider/signups....)
      • For every controller, a metric for every action (show, usage...)
      • For every action, a metric for every status (200, 302...)
      • For every status, a metric for every format (json, html, /...)
      • For every format, a metric for every method (get, post...)
      • For every method, a metric for every time bucket le le (0.005, 01...)

      Short example:

      ...
      rails_view_runtime_seconds_bucket{controller="stats/data/services",action="top_applications",status="200",format="json",method="get",le="120"} 2
      rails_view_runtime_seconds_bucket{controller="stats/data/services",action="top_applications",status="200",format="json",method="get",le="2.5"} 2
      rails_view_runtime_seconds_bucket{controller="stats/data/services",action="top_applications",status="200",format="json",method="get",le="30"} 2
      rails_view_runtime_seconds_bucket{controller="stats/data/services",action="top_applications",status="200",format="json",method="get",le="300"} 2
      rails_view_runtime_seconds_bucket{controller="stats/data/services",action="top_applications",status="200",format="json",method="get",le="5"} 2
      rails_view_runtime_seconds_bucket{controller="stats/data/services",action="top_applications",status="200",format="json",method="get",le="60"} 2
      rails_view_runtime_seconds_bucket{controller="stats/data/services",action="top_applications",status="200",format="json",method="get",le="600"} 2
      rails_view_runtime_seconds_bucket{controller="stats/data/services",action="usage",status="200",format="json",method="get",le="+Inf"} 3
      rails_view_runtime_seconds_bucket{controller="stats/data/services",action="usage",status="200",format="json",method="get",le="0.005"} 3
      rails_view_runtime_seconds_bucket{controller="stats/data/services",action="usage",status="200",format="json",method="get",le="0.01"} 3
      rails_view_runtime_seconds_bucket{controller="stats/data/services",action="usage",status="200",format="json",method="get",le="0.025"} 3
      rails_view_runtime_seconds_bucket{controller="stats/data/services",action="usage",status="200",format="json",method="get",le="0.05"} 3
      rails_view_runtime_seconds_bucket{controller="stats/data/services",action="usage",status="200",format="json",method="get",le="0.1"} 3
      rails_view_runtime_seconds_bucket{controller="stats/data/services",action="usage",status="200",format="json",method="get",le="0.25"} 3
      rails_view_runtime_seconds_bucket{controller="stats/data/services",action="usage",status="200",format="json",method="get",le="0.5"} 3
      rails_view_runtime_seconds_bucket{controller="stats/data/services",action="usage",status="200",format="json",method="get",le="1"} 3
      rails_view_runtime_seconds_bucket{controller="stats/data/services",action="usage",status="200",format="json",method="get",le="10"} 3
      rails_view_runtime_seconds_bucket{controller="stats/data/services",action="usage",status="200",format="json",method="get",le="120"} 3
      rails_view_runtime_seconds_bucket{controller="stats/data/services",action="usage",status="200",format="json",method="get",le="2.5"} 3
      rails_view_runtime_seconds_bucket{controller="stats/data/services",action="usage",status="200",format="json",method="get",le="30"} 3
      ...
      

      The problem is being so much detailed, it did not add any value giving so much detailed in metrics (the same level of details that you can already find in application logs).

      Actually, aside from making prometheus to use so much memory, it makes the Grafana Dashboard to not work because it needs to calculate so many operations with so much metrics that Grafana gives HTTP 504 Gateway Timeout because prometheus does not answer in maximum expected time.

      Summary to the work to be done:

      • Remove current system-app /metrics (because sidekiq metrics are already correctly covered by system-sidekiq pods)
      • Move system-app /yabeda-metrics to system-app /metrics (because it is the standard metrics path)
      • Optimize system-app /yabeda-metrics (future system-app /metrics) so they not produce such a huge number of metrics with huge cardinality. You can check current backend-listener metrics. Backend-listener are quite well optimized, they produces a lot of metrics too, but nothing compared to system-app yabeda-metrics.
      • Contact 3scale-operator team in order to remove references to yabeda-metrics, mainly stop scraping /yabeda-metrics, because they will be published in standard /metrics endpoint. some of the required changes will be here https://github.com/3scale/3scale-operator/blob/abb493cc34b7e03e473560be78e824f6ee3a255f/pkg/3scale/amp/component/system_monitoring.go#L35-L79

              Unassigned Unassigned
              slopezma@redhat.com Sergio Lopez
              Matej Dujava Matej Dujava
              Daria Mayorova Daria Mayorova
              Votes:
              0 Vote for this issue
              Watchers:
              6 Start watching this issue

                Created:
                Updated: