-
Story
-
Resolution: Done
-
Major
-
None
-
None
-
None
Story (Required)
As a DevOps engineer debugging a failed Tekton PipelineRun
I want an API endpoint (/pipelinerun/explainFailure) that either lists failed TaskRuns for deeper inspection or directly analyzes the PipelineRun failure
So that I can quickly determine if the failure is inside TaskRuns or at the PipelineRun level, and efficiently troubleshoot.
Background (Required)
Currently, investigating a failed PipelineRun requires:
- Checking the PipelineRun status.
- Inspecting associated TaskRuns.
- Drilling down into failing TaskRuns individually.
This is manual and error-prone. The story streamlines the workflow:
- If TaskRuns exist → return list of failed TaskRuns and prompt user to diagnose via /taskrun/explainFailure.
- If no TaskRuns exist → analyze PipelineRun failure directly with LLM.
This improves developer productivity and reduces time to resolution.
Out of scope
- Automatic diagnosis of all failed TaskRuns in a PipelineRun (only listing is included).
- Multi-pipeline correlation.
- Automatic retries or self-healing.
Approach (Required)
- Check PipelineRun status
-
- Fetch PipelineRun object from Kubernetes.
-
- Inspect .status.conditions.
- Check TaskRuns
-
- Query associated TaskRuns using pipelineRun=<name> label.
-
- If failed TaskRuns exist → return structured list.
-
- If no TaskRuns exist → analyze the PipelineRun’s status message.
- Expose API
-
- GET /pipelinerun/explainFailure?name=<pipelinerun>&namespace=<ns>API response schema
- API response schema
````
{ "pipelineRun": { "name": "pipelinerun-go-golangci-lint", "namespace": "default", "uid": "a1b2c3d4", "labels": {}, "annotations": {} }, "status": { "phase": "Failed", "startTime": "2025-09-15T06:34:58Z", "completionTime": "2025-09-15T06:35:00Z", "durationSeconds": 2, "conditions": [ { "type": "Succeeded", "status": "False", "reason": "CouldntGetTask", "message": "pipeline validation failed: task not found", "lastTransitionTime": "2025-09-15T06:35:00Z" } ] }, "failedTaskRuns": [], "analysis": "No TaskRuns were created. PipelineRun failed during validation or scheduling.", }, }
Dependencies
<Describes what this story depends on. Dependent Stories and EPICs should be linked to the story.>
Acceptance Criteria (Mandatory)
<Describe edge cases to consider when implementing the story and defining tests>
<Provides a required and minimum list of acceptance tests for this story. More is expected as the engineer implements this story>
INVEST Checklist
Dependencies identified
Blockers noted and expected delivery timelines set
Design is implementable
Acceptance criteria agreed upon
Story estimated
Legend
Unknown
Verified
Unsatisfied
Done Checklist
- Code is completed, reviewed, documented and checked in
- Unit and integration test automation have been delivered and running cleanly in continuous integration/staging/canary environment
- Continuous Delivery pipeline(s) is able to proceed with new code included
- Customer facing documentation, API docs etc. are produced/updated, reviewed and published
- Acceptance criteria are met