Uploaded image for project: 'OpenShift Pipelines'
  1. OpenShift Pipelines
  2. SRVKP-1528

Possible cardinality issue with tekton pipelines metrics

XMLWordPrintable

    • Icon: Bug Bug
    • Resolution: Done
    • Icon: Major Major
    • Pipelines 1.6
    • Pipelines 1.4, Pipelines 1.4.1
    • Tekton Pipelines
    • None
    • False
    • False
    • Undefined
    • Pipelines Sprint 207

      we found that the cluster's prometheus instance was under heavy load and tracked it down to the top 2 heavy queries in the cluster. these were:

      • tekton_pipelines_controller_pipelinerun_taskrun_duration_seconds_bucket
      • tekton_pipelines_controller_pipelinerun_duration_seconds_bucket

      we trigger a lot of pipelines, so within a few days we hit ~8k PipelineRun CRs on a single cluster

      for tekton_pipelines_controller_pipelinerun_taskrun_duration_seconds_bucket we currently have ~200k metrics published and ~100k for tekton_pipelines_controller_pipelinerun_duration_seconds_bucket

      it looks like some labels that are used are cuasing cardinality explosion: pipelinerun, taskrun

       

      is there anything to do about these metrics? i think these may lead to our inability to use tekton due to our scale? take this statement with a grain of salt because we have "worked around" by deleting the ServiceMonitor.

      This was discovered during an incident with an OSD cluster with Tekton widely used on it. The incident's RCA doc: https://docs.google.com/document/d/1U_xJtIBDABCEbJhVdJK7Tw6ftTARkdI39SLayjt56YU

      Thanks a lot!

            vdemeest Vincent Demeester
            mafriedm Maor Friedman
            Votes:
            0 Vote for this issue
            Watchers:
            10 Start watching this issue

              Created:
              Updated:
              Resolved: