Loading...

XML

Word

Printable

Type: Bug
Resolution: Done
Priority: Major
Fix Version/s: Pipelines 1.6
Affects Version/s: Pipelines 1.4, Pipelines 1.4.1
Component/s: Tekton Pipelines
Labels:
None

Blocked:
False
Ready:
False
Release Note Text:
Undefined
Market:

Sprint:
Pipelines Sprint 207

SFDC Cases Counter:
SFDC Cases Open:
SFDC Cases Links:

we found that the cluster's prometheus instance was under heavy load and tracked it down to the top 2 heavy queries in the cluster. these were:

tekton_pipelines_controller_pipelinerun_taskrun_duration_seconds_bucket
tekton_pipelines_controller_pipelinerun_duration_seconds_bucket

we trigger a lot of pipelines, so within a few days we hit ~8k PipelineRun CRs on a single cluster

for tekton_pipelines_controller_pipelinerun_taskrun_duration_seconds_bucket we currently have ~200k metrics published and ~100k for tekton_pipelines_controller_pipelinerun_duration_seconds_bucket

it looks like some labels that are used are cuasing cardinality explosion: pipelinerun, taskrun

is there anything to do about these metrics? i think these may lead to our inability to use tekton due to our scale? take this statement with a grain of salt because we have "worked around" by deleting the ServiceMonitor.

This was discovered during an incident with an OSD cluster with Tekton widely used on it. The incident's RCA doc: https://docs.google.com/document/d/1U_xJtIBDABCEbJhVdJK7Tw6ftTARkdI39SLayjt56YU

Thanks a lot!

links to

openshift/ops-sop#1455: investigate KubePV for monitoring namespace

Assignee:: Vincent Demeester

Reporter:: Maor Friedman

Votes:: 0 Vote for this issue

Watchers:: 10 Start watching this issue

Created:: 2021/06/08 12:19 PM

Updated:: 2022/07/13 7:00 PM

Resolved:: 2021/09/13 8:04 AM

Details

Description

Attachments

Issue Links

Activity

People

Dates

Hide