-
Bug
-
Resolution: Done
-
Major
-
Pipelines 1.4, Pipelines 1.4.1
-
None
-
False
-
False
-
Undefined
-
-
Pipelines Sprint 207
we found that the cluster's prometheus instance was under heavy load and tracked it down to the top 2 heavy queries in the cluster. these were:
- tekton_pipelines_controller_pipelinerun_taskrun_duration_seconds_bucket
- tekton_pipelines_controller_pipelinerun_duration_seconds_bucket
we trigger a lot of pipelines, so within a few days we hit ~8k PipelineRun CRs on a single cluster
for tekton_pipelines_controller_pipelinerun_taskrun_duration_seconds_bucket we currently have ~200k metrics published and ~100k for tekton_pipelines_controller_pipelinerun_duration_seconds_bucket
it looks like some labels that are used are cuasing cardinality explosion: pipelinerun, taskrun
is there anything to do about these metrics? i think these may lead to our inability to use tekton due to our scale? take this statement with a grain of salt because we have "worked around" by deleting the ServiceMonitor.
This was discovered during an incident with an OSD cluster with Tekton widely used on it. The incident's RCA doc: https://docs.google.com/document/d/1U_xJtIBDABCEbJhVdJK7Tw6ftTARkdI39SLayjt56YU
Thanks a lot!