Uploaded image for project: 'OpenShift Pipelines'
  1. OpenShift Pipelines
  2. SRVKP-8078

Create metrics to measure latency to store a run after completion

XMLWordPrintable

    • 5
    • False
    • Hide

      None

      Show
      None
    • False
    • Release Note Not Required
    • Done
    • Pipelines Sprint Crookshank 32, Pipelines Sprint Crookshank 33, Pipelines Sprint Crookshank 34

      Story (Required)

      As a <PERSONA> trying to <ACTION> I want <THIS OUTCOME>

      <Describes high level purpose and goal for this story. Answers the questions: Who is impacted, what is it and why do we need it? How does it improve the customer’s experience?>

      Background (Required)

      **

       

      Few details:

      • stored_latency_seconds: distribution, facets: {type (TaskRun/PipelineRun), namespace: (maybe optional), success: boolean} - Recording the time between when the Run finished and when Results was able to mark it as Stored. Since this is a histogram we can extrapolate a lot: using `sum/count` we can get the average latency, and using `count` we can get the number of runs stored faceted on the storage success, namespace, type, etc.

      > In all of the above, the metrics should be per unique Run. That is to say if a PipelineRun is upserted 12 times over its lifetime, it's useful to know it was stored 12 times but that isn't the purpose of these metrics; the discreet number of pipelineruns and taskruns is important. All of those are helpful to understand per namespace as well.

       

      To clarify the above, I mean to say that most of the things metrics can tell us for Results will be inaccurate if they're emitted every reconcilation. For example, in the case of "latency between PipelineRun being Done and being Stored", I want to know how long after completion was a PLR stored in the database. However if we emit the metric on every reconcilation and a PLR is reconciled 5 times after completion (maybe 0.1s, 0.4s, 0.5s, 1s, and 5m after completion) I want the metric to report `0,1` but the metric is going to include other data points which will skew the actual values. This can be solved by emitting the metric(s) when we detect a transition from one state to another. Instead of emitting the metric if plr.IsDone(), we can emit the metric if plr.CompletionTime > plr.LastStoredTime.

      Out of scope

      <Defines what is not included in this story>

      Approach (Required)

      <Description of the general technical path on how to achieve the goal of the story. Include details like json schema, class definitions>

      Dependencies

      <Describes what this story depends on. Dependent Stories and EPICs should be linked to the story.>

      Acceptance Criteria (Mandatory)

      <Describe edge cases to consider when implementing the story and defining tests>

      <Provides a required and minimum list of acceptance tests for this story. More is expected as the engineer implements this story>

      INVEST Checklist

      Dependencies identified

      Blockers noted and expected delivery timelines set

      Design is implementable

      Acceptance criteria agreed upon

      Story estimated

      Legend

      Unknown

      Verified

      Unsatisfied

      Done Checklist

      • Code is completed, reviewed, documented and checked in
      • Unit and integration test automation have been delivered and running cleanly in continuous integration/staging/canary environment
      • Continuous Delivery pipeline(s) is able to proceed with new code included
      • Customer facing documentation, API docs etc. are produced/updated, reviewed and published
      • Acceptance criteria are met

              diagrawa Divyanshu Agrawal
              diagrawa Divyanshu Agrawal
              Votes:
              0 Vote for this issue
              Watchers:
              2 Start watching this issue

                Created:
                Updated:
                Resolved: