Type: Story
Resolution: Unresolved
Priority: Normal
Fix Version/s: None
Affects Version/s: None
Component/s: Tekton Pipelines
Labels:
- customer
- konflux

Story Points:
8
Epic Link:
SRE support: onboard AppSRE team
Blocked:
False
Blocked Reason:
None
Ready:
False
Git Pull Request:
https://github.com/openshift-pipelines/pipeline-service-exporter/pull/66
Intelligence Requested:
Market:

Sprint:
Sprint 258, Pipelines Sprint Pioneers 3, Pipelines Sprint Pioneers 4

SFDC Cases Counter:
SFDC Cases Open:
SFDC Cases Links:

Story (Required)

As a maintainer of Konflux trying to montior tekton health I want to know when tekton is not even attempting to create Pods for validly formatted pipelineruns

<Describes high level purpose and goal for this story. Answers the questions: Who is impacted, what is it and why do we need it? How does it improve the customer’s experience?>

Background (Required)

<Describes the context or background related to this story>

While our existing alert metrics will signal when the core controller is struggling performance wise to create the Pods required for PipelineRuns, it has been pointed out that if the controller totally locks up and does not even attempt to create the Pods, that level of outage could be missed.

Also, while we could conceivably use an increasing workqueue depth after surpassing a sufficient threshold also as an indicator of the core controller being deadlocked, that is very coarse grained and does not give us insight into specific projects exhibiting the deadlock, which could be very important if after examining specific namespaces we see in fact the alert firing is a false positive, and the namespace or node it is on is just severely constrained from a k8s level perspective.

Out of scope

<Defines what is not included in this story>

Building metrics for pod create failures is a different use case, in that resolution is not "why is tekton controller frozen", but the OCP platform is misconfigured, user usages surpass quota, or the pipeline was misconfigured. Resolving those outcomes are outside of core tekton and those responsible for supporting it.

When time permits, separate scenarios can be negotiated, with new epics opened.

Approach (Required)

<Description of the general technical path on how to achieve the goal of the story. Include details like json schema, class definitions>

We'll update our downstream metric exporter. Experimental nature of the metric, and the need to get this sooner rather than later, pre-empt starting upstream initially for this one.

We can go upstream afterward, using downstream as a reference, if/when external customers express a desire.

We'll add gauge metrics that bump up when a taskrun is created but the pod creation attempt has not occurred. Once it occurs, we'll decrement the gauge metric.

If the pod name is set, decrement. If the create pod fails and that failure is recorded in the suceeded condition, decrement.

Dependencies

<Describes what this story depends on. Dependent Stories and EPICs should be linked to the story.>

Acceptance Criteria (Mandatory)

<Describe edge cases to consider when implementing the story and defining tests>

<Provides a required and minimum list of acceptance tests for this story. More is expected as the engineer implements this story>

Done Checklist

Code is completed, reviewed, documented and checked in
Unit and integration test automation have been delivered and running cleanly in continuous integration/staging/canary environment
Continuous Delivery pipeline(s) is able to proceed with new code included
Customer facing documentation, API docs etc. are produced/updated, reviewed and published
Acceptance criteria are met

blocks

SRVKP-4523 build alert, panels for core tekton controller deadlocked metrics

To Do

is cloned by

SRVKP-4525 build or expose metrics to determine of results watcher/api are deadlocked or performance severely degraded

To Do

links to

code refactor to fix mem usage before we add even more metrics

redhat-appstudio/infra-deployments#3563: PLNSRVCE-1692: pipeline-service update

redhat-appstudio/infra-deployments#3586: PLNSRVCE-1692: bump prod pipeline service level to stage level

Details

Description

Story (Required)

Background (Required)

Out of scope

Approach (Required)

Dependencies

Acceptance Criteria (Mandatory)

Done Checklist

Attachments

Issue Links

Activity

People

Dates