Loading...

Type: Bug
Resolution: Unresolved
Priority: Major
Fix Version/s: None
Affects Version/s: None
Component/s: System
Labels:
None

Blocked:
False
Blocked Reason:

Hide

None

Show
None
Ready:
False
3Scale PT Tested upstream:
Not Started
3scale PT Docs:
Not Started
3scale PT Product Specs:
Not Started
3scale PT Product Update Ready:
Not Started
3scale PT Released In Saas:
Not Started
3scale PT Verified Product:
Not Started
Intelligence Requested:
Market:

Severity:
Low

SFDC Cases Counter:
SFDC Cases Open:
SFDC Cases Links:

Right now system-app pods are providing 2 metrics endpoint:

/metrics: it was the initial metrics endpoint, it provides a few summary of sidekiq metrics that actually are already better covered by system-sidekiq pods.

/yabeda-metrics should be the ones reported at /metrics ones, containing ruby and HTTP metrics (the important ones for system-app). However there are too many metrics that are producing so many prometheus timeseries, that cardinality is so high that produces a huge increase of prometheus memory.

In the specific case of 3scale SaaS (although any customer with monitoring enabled is affected, worse if the number of system-add pods is high), we needed to raise the prometheus memory limit from 4GB to 10GB, and it was still not enough https://github.com/3scale/platform/pull/1195

That we finally disabled this /yabeda-metrics scrape from saas-operator until this scrapper endpoint gets optimized at code (porta) level, and it is secure to enable it again without risking prometheus health. With the scrape disabled, prometheus memory decreased inmediately:

https://github.com/3scale-ops/saas-operator/pull/251

These are the kind of metrics that is adding for every scrapped system-app pod at /yabeda-metrics endpoint:

...
rails_view_runtime_seconds_sum
rails_view_runtime_seconds_count
rails_view_runtime_seconds_bucket
...

And then, for each metric there is an independent time series:

For every controller (stats/api/services, stats/api/applications, provider/signups....)
For every controller, a metric for every action (show, usage...)
For every action, a metric for every status (200, 302...)
For every status, a metric for every format (json, html, /...)
For every format, a metric for every method (get, post...)
For every method, a metric for every time bucket le le (0.005, 01...)

Short example:

...
rails_view_runtime_seconds_bucket{controller="stats/data/services",action="top_applications",status="200",format="json",method="get",le="120"} 2
rails_view_runtime_seconds_bucket{controller="stats/data/services",action="top_applications",status="200",format="json",method="get",le="2.5"} 2
rails_view_runtime_seconds_bucket{controller="stats/data/services",action="top_applications",status="200",format="json",method="get",le="30"} 2
rails_view_runtime_seconds_bucket{controller="stats/data/services",action="top_applications",status="200",format="json",method="get",le="300"} 2
rails_view_runtime_seconds_bucket{controller="stats/data/services",action="top_applications",status="200",format="json",method="get",le="5"} 2
rails_view_runtime_seconds_bucket{controller="stats/data/services",action="top_applications",status="200",format="json",method="get",le="60"} 2
rails_view_runtime_seconds_bucket{controller="stats/data/services",action="top_applications",status="200",format="json",method="get",le="600"} 2
rails_view_runtime_seconds_bucket{controller="stats/data/services",action="usage",status="200",format="json",method="get",le="+Inf"} 3
rails_view_runtime_seconds_bucket{controller="stats/data/services",action="usage",status="200",format="json",method="get",le="0.005"} 3
rails_view_runtime_seconds_bucket{controller="stats/data/services",action="usage",status="200",format="json",method="get",le="0.01"} 3
rails_view_runtime_seconds_bucket{controller="stats/data/services",action="usage",status="200",format="json",method="get",le="0.025"} 3
rails_view_runtime_seconds_bucket{controller="stats/data/services",action="usage",status="200",format="json",method="get",le="0.05"} 3
rails_view_runtime_seconds_bucket{controller="stats/data/services",action="usage",status="200",format="json",method="get",le="0.1"} 3
rails_view_runtime_seconds_bucket{controller="stats/data/services",action="usage",status="200",format="json",method="get",le="0.25"} 3
rails_view_runtime_seconds_bucket{controller="stats/data/services",action="usage",status="200",format="json",method="get",le="0.5"} 3
rails_view_runtime_seconds_bucket{controller="stats/data/services",action="usage",status="200",format="json",method="get",le="1"} 3
rails_view_runtime_seconds_bucket{controller="stats/data/services",action="usage",status="200",format="json",method="get",le="10"} 3
rails_view_runtime_seconds_bucket{controller="stats/data/services",action="usage",status="200",format="json",method="get",le="120"} 3
rails_view_runtime_seconds_bucket{controller="stats/data/services",action="usage",status="200",format="json",method="get",le="2.5"} 3
rails_view_runtime_seconds_bucket{controller="stats/data/services",action="usage",status="200",format="json",method="get",le="30"} 3
...

The problem is being so much detailed, it did not add any value giving so much detailed in metrics (the same level of details that you can already find in application logs).

Actually, aside from making prometheus to use so much memory, it makes the Grafana Dashboard to not work because it needs to calculate so many operations with so much metrics that Grafana gives HTTP 504 Gateway Timeout because prometheus does not answer in maximum expected time.

Summary to the work to be done:

Remove current system-app /metrics (because sidekiq metrics are already correctly covered by system-sidekiq pods)
Move system-app /yabeda-metrics to system-app /metrics (because it is the standard metrics path)
Optimize system-app /yabeda-metrics (future system-app /metrics) so they not produce such a huge number of metrics with huge cardinality. You can check current backend-listener metrics. Backend-listener are quite well optimized, they produces a lot of metrics too, but nothing compared to system-app yabeda-metrics.
Contact 3scale-operator team in order to remove references to yabeda-metrics, mainly stop scraping /yabeda-metrics, because they will be published in standard /metrics endpoint. some of the required changes will be here https://github.com/3scale/3scale-operator/blob/abb493cc34b7e03e473560be78e824f6ee3a255f/pkg/3scale/amp/component/system_monitoring.go#L35-L79