Uploaded image for project: 'OpenShift Pipelines'
  1. OpenShift Pipelines
  2. SRVKP-4529

build or expose metrics to determine if chains controller is deadlocked or performance severely degraded

XMLWordPrintable

    • Icon: Story Story
    • Resolution: Done
    • Icon: Normal Normal
    • None
    • None
    • Tekton Chains
    • 3
    • False
    • None
    • False
    • KONFLUX-123 - Konflux Availability SLO phase 1
    • Release Note Not Required
    • Pipelines Sprint Pioneers 10

      Story (Required)

      As a maintainer of Konflux trying to montior tekton health I want to know when tekton chains if deadlocked or suffering from sufficient performance degradation.

      <Describes high level purpose and goal for this story. Answers the questions: Who is impacted, what is it and why do we need it? How does it improve the customer’s experience?>

      Background (Required)

      <Describes the context or background related to this story>

      So for the chains performance degradation that ended up being tuning the number of thread and k8s client qps/burst off development settings, I used

      sum(watcher_workqueue_depth{container='tekton-chains-controller',app='tekton-chains-controller'})

      sum(watcher_workqueue_depth{container='tekton-chains-controller',app='tekton-chains-controller'})

      to first demonstrate the bad numbers (1000s queue depth, double digit second latency).  Then proved the queue was single digit at worst and typically sub-second latency afterward.

      Establishing safe baselines in prod using historical data and an alert query that has it above a reasonable threshold is meets min.

      Out of scope

      <Defines what is not included in this story>

       

      Approach (Required)

      <Description of the general technical path on how to achieve the goal of the story. Include details like json schema, class definitions>

      Aside from workqueue and latency in the background as a core health metric, as that is what I used to first confirm the perf issue wrt chains needed thread count tuning and k8s cps/burst, and then used to prove the new tuning helped, work with lucarval@redhat.com and the konflux EC team to see if the existing count metrics in chains, or known metric features still to be implemented, would be good signals for chains health.

       

      Then, investigate chains reconciler to see if there and see if there are certain labels, annotations, etc. that

      it always sets, and build a metric that confirms those are set.  This validation is extra credit.

      Dependencies

      <Describes what this story depends on. Dependent Stories and EPICs should be linked to the story.>

       

      Acceptance Criteria  (Mandatory)

      <Describe edge cases to consider when implementing the story and defining tests>

      <Provides a required and minimum list of acceptance tests for this story. More is expected as the engineer implements this story>

       

      Done Checklist

      • Code is completed, reviewed, documented and checked in
      • Unit and integration test automation have been delivered and running cleanly in continuous integration/staging/canary environment
      • Continuous Delivery pipeline(s) is able to proceed with new code included
      • Customer facing documentation, API docs etc. are produced/updated, reviewed and published
      • Acceptance criteria are met

              gmontero@redhat.com Gabe Montero
              gmontero@redhat.com Gabe Montero
              Votes:
              0 Vote for this issue
              Watchers:
              2 Start watching this issue

                Created:
                Updated:
                Resolved: