Uploaded image for project: 'OpenShift Pipelines'
  1. OpenShift Pipelines
  2. SRVKP-4522

build metric to determine of core tekton controller is not creating pods for pipelines, determine if it is deadlocked

XMLWordPrintable

    • Sprint 258, Pipelines Sprint Pioneers 3, Pipelines Sprint Pioneers 4

      Story (Required)

      As a maintainer of Konflux trying to montior tekton health I want to know when tekton is not even attempting to create Pods for validly formatted pipelineruns

      <Describes high level purpose and goal for this story. Answers the questions: Who is impacted, what is it and why do we need it? How does it improve the customer’s experience?>

      Background (Required)

      <Describes the context or background related to this story>

      While our existing alert metrics will signal when the core controller is struggling performance wise to create the Pods required for PipelineRuns, it has been pointed out that if the controller totally locks up and does not even attempt to create the Pods, that level of outage could be missed.

      Also, while we could conceivably use an increasing workqueue depth after surpassing a sufficient threshold also as an indicator of the core controller being deadlocked, that is very coarse grained and does not give us insight into specific projects exhibiting the deadlock, which could be very important if after examining specific namespaces we see in fact the alert firing is a false positive, and the namespace or node it is on is just severely constrained from a k8s level perspective.

      Out of scope

      <Defines what is not included in this story>

      Building metrics for pod create failures is a different use case, in that resolution is not "why is tekton controller frozen", but the OCP platform is misconfigured, user usages surpass quota, or the pipeline was misconfigured.  Resolving those outcomes are outside of core tekton and those responsible for supporting it.

       

      When time permits, separate scenarios can be negotiated, with new epics opened.

      Approach (Required)

      <Description of the general technical path on how to achieve the goal of the story. Include details like json schema, class definitions>

       

      We'll update our downstream metric exporter.  Experimental nature of the metric, and the need to get this sooner rather than later, pre-empt starting upstream initially for this one.

      We can go upstream afterward, using downstream as a reference, if/when external customers express a desire.

      We'll add gauge metrics that bump up when a taskrun is created but the pod creation attempt has not occurred.  Once it occurs, we'll decrement the gauge metric.

      If the pod name is set, decrement.  If the create pod fails and that failure is recorded in the suceeded condition, decrement.

      Dependencies

      <Describes what this story depends on. Dependent Stories and EPICs should be linked to the story.>

       

      Acceptance Criteria  (Mandatory)

      <Describe edge cases to consider when implementing the story and defining tests>

      <Provides a required and minimum list of acceptance tests for this story. More is expected as the engineer implements this story>

       

      Done Checklist

      • Code is completed, reviewed, documented and checked in
      • Unit and integration test automation have been delivered and running cleanly in continuous integration/staging/canary environment
      • Continuous Delivery pipeline(s) is able to proceed with new code included
      • Customer facing documentation, API docs etc. are produced/updated, reviewed and published
      • Acceptance criteria are met

            gmontero@redhat.com Gabe Montero
            gmontero@redhat.com Gabe Montero
            Votes:
            0 Vote for this issue
            Watchers:
            1 Start watching this issue

              Created:
              Updated: