-
Epic
-
Resolution: Unresolved
-
Undefined
-
None
-
None
-
Enhanced Observability for Tekton
-
False
-
-
False
-
To Do
-
-
Note: This epic is created is response to the comment from rh-ee-csalinas on https://issues.redhat.com/browse/SRVKP-7692* *
Epic Goal:
As a Product Owner, I propose the addition of enhanced metrics and metadata collection to increase visibility and granularity into Pipeline Runs. This will improve the ability to support customers, debug issues faster, monitor feature adoption, and drive product decisions based on data.{}
Why is this Important?
- Improves product observability for internal and external stakeholders.
- Enables Red Hat Support and SRE teams to better diagnose problems with detailed insights.
- Helps Product Management track feature adoption (e.g., IPAC, TTL vs History settings).
- Facilitates understanding of user workloads and pipeline composition for product improvements.
- Aligns with customer feedback asking for deeper introspection into their CI/CD operations.
Key Scenarios:
- A pipeline fails and support needs to know what Task images were involved.
- Product team wants to measure how many customers are using IPAC vs static YAML.
- SRE wants to track if TTL or History-based pruning causes more cleanup failures.
- Engineering wants to identify common upstream vs custom Task usage patterns.
- A customer files a bug where the trigger source (manual vs webhook) is relevant to debugging.
Proposed Metrics and Metadata to Capture:
Category | Metric/Metadata | Purpose |
---|---|---|
Pruner Configuration | TTL-based vs History-based usage | Adoption insight and failure correlation |
Adoption Tracking | IPAC-enabled (yes/no flag per PipelineRun) | Adoption of Pipelines as Code |
Pipeline Composition | Task IDs used per PipelineRun | Determine task source (upstream, Red Hat, custom) |
Step Execution Details | Step-level image references | Identify base images in use, support diagnostics |
Run Metadata | Triggering source | Understand usage patterns and potential automation gaps |
Run Results | Execution time and status | Performance benchmarking and error/success insights |
Acceptance Criteria (Mandatory): * CI pipelines run successfully with new metrics collected and tests automated.
- Metric data is exposed via Prometheus endpoints and validated.
- Metric labels follow existing conventions (e.g., namespace, pipeline_name, task_id).
- Documentation is updated with new metric definitions and intended usage.
- Technical enablement provided to stakeholders (Support, SRE, Docs, etc.).
- Adoption and error trends are reviewable via dashboards or queries.
Done Checklist:
- Acceptance criteria are met.
- Non-functional requirements validated (performance, security, privacy).
- User journey automation is tested and validated.
- Release enablement materials are created.
- Support and SRE teams trained on interpreting and using the new data.
Dependencies:
- Coordination with Pipelines team to expose metadata in PipelineRun CRD
- Prometheus integration and dashboarding (internal platform team)
- Docs team for updating metric reference documentation
- Support/SRE teams for enablement
Previous Work (Optional):
- Existing metrics in tektoncd_pruner_*
- Previous enablement of Pipelines as Code metadata capture
Open Questions:
- Should step image tracking be full image reference (with tag/digest) or anonymized?
- Do we want to track exact Task source (Hub, Red Hat, custom), or just ID?
- Is there customer sensitivity around publishing IPAC usage metrics?
- Will adoption dashboards be internal-only or exposed to customers?
- is blocked by
-
SRVKP-8808 Testing for the epic
-
- To Do
-