-
Epic
-
Resolution: Unresolved
-
Major
-
None
-
Distributed Tracing documentation and enhancement for Openshift Pipeliens
-
False
-
-
False
-
-
To Do
-
SRVKP-7109 - Distributed tracing for Tasks and Pipelines
-
97% To Do, 3% In Progress, 0% Done
-
-
Epic Goal
Distributed tracing across Tekton Pipeline controllers and resource paths (PipelineRun/TaskRun, ResolutionRequest, CustomRun, pod/entrypoint), so developers can optimize reconcilers and users can observe pipelines with Jaeger/OTLP backends.
Why is this important?
- Customer request and e2e tracing implementation
Scenarios
- Need to add
Acceptance Criteria (Mandatory)
Coverage & propagation
- Tracing present in PipelineRun, TaskRun, ResolutionRequest, CustomRun controllers; key resource functions (parameters, workspaces, results, when-expressions), pod/entrypoint flows.
- W3C trace context propagated via HTTP headers and resource annotations; preserved through webhooks and downstream work.
Correlation & observability
- Trace and span IDs are attachable to Prometheus metrics and Kubernetes Events to allow click-through from metrics/events to traces.
- Users can view a complete path from PipelineRun to pod creation in the tracing UI.
Error/timeout semantics
- Errors recorded consistently with type/category (user vs system), messages, and (where unexpected) stack traces. Timeouts and cancellations traced with reason metadata.
Performance & configuration
- Head/rate-limited sampling strategies are configurable (service name/env, Jaeger HTTP/gRPC exporters), with <≈5% overhead verified by benchmarks; attribute collection uses lazy/size-bounded patterns.
- Configuration is validated; misconfigurations produce actionable errors; docs and examples updated.
Quality bar
- Unit + integration tests for propagation, controller spans, resource-path spans, correlation hooks; reusable test utilities/mocks; 70+% coverage on tracing code.
OpenShift-specific
-
- Works out-of-the-box on OpenShift with: OTEL Collector + Jaeger/Tempo Operator backends; OpenShift Pipelines Operator supported versions.
-
- Kustomize overlays for OpenShift install (config-tracing, RBAC, NetworkPolicy, TLS).
-
- Supports cluster proxy/TLS, namespace multi-tenancy, and Red Hat image/FIPS constraints.
-
- Red Hat docs include OpenShift “day-2” guidance: sampling knobs, backend options, troubleshooting.
Dependencies (internal and external)
Sequence (inter-story)
Phases:-
1: Foundation and Configuration
2: ResolutionRequest Controller Instrumentation
3: CustomRun Controller Instrumentation
4: PipelineRun Resource Functions Instrumentation
5: TaskRun Resource Functions Instrumentation
6: Pod Operations Instrumentation
7: Timeout Instrumentation
8: Metrics and Events Correlation
9: Performance Optimization
- Foundation (Phase-1: config, test infra, propagation) completes first; Phases 2–6 (controllers/resources/pod) can proceed in parallel; correlation (Phase-8) and enhanced error/timeout (Phase-7) depend on those; performance work (Phase-9) can run alongside 7–8.
Foundational tech
- Config plumbing for service name/env, Jaeger exporter (HTTP+gRPC), sampling; config validation & docs.
- Test harness: mock tracer, span verifiers, in-memory exporter; trace-propagation tests available for all stories.
Downstream touchpoints
- Resource-path work (params/workspaces/results/when) depends on base tracer in controllers and propagation in place.
- Correlation hooks depend on metrics/event emission code paths being trace-ID aware.
OpenShift integration (for the newly added stories)
- Cluster components: OpenTelemetry Collector deployment, Jaeger or Tempo (via Operator), and OpenShift Pipelines Operator at required minima.
- Platform constraints: cluster proxy settings, CA bundles, OAuth-backed UI routing where applicable, and namespace isolation (multi-tenant).
- Packaging: OpenShift overlays/kustomizations; RBAC/NetworkPolicy updates; optional ServiceMesh/ingress specifics if traces egress via mesh.
- Release/docs: Red Hat-specific guides, example ConfigMaps/Secrets, and support matrix noted in product docs.
Previous Work (Optional):
The following is the current state of this feature, which I am adding as a note here.
Currently, we have controller-level distributed tracing with function spans for PipelineRun/TaskRun reconciliation cycles, Jaeger backend support via OTLP, and trace propagation between resources.
Our areas of further work include Task execution tracing (the actual pod/step execution traces), OTEL environment variable configuration support, configurable tracing levels, and user-injectable custom traces.
From what I understand, the current OTLP HTTP tracing implementation in Tekton is already compatible with the OpenShift observability platform because OpenShift natively supports OTLP ingestion. What needs work is configuration updates (pointing to Cluster Observability Operator endpoints instead of direct Jaeger) and documentation updates (showing how to set up with the new operator instead of a standalone Jaeger installation).
This would require an EPIC to be created for initial documentation, initial integration of OpenShift observability, and which can be tested by the customer, and then implementation of the rest of the traces and whichever ones the customer might require on priority. Among the requirements we have we support the following ones currently and need to probably be documented better with integraiton with cluster observability.
```
Support Jaeger backend - Fully implemented via OTLP HTTP exporter that works with Jaeger
Propagate traces so that subsequent reconciles of the same resource belong to the same trace - Implemented via SpanContext storage in resource status and annotation-based propagation
Propagate traces so that reconciles of a resource owned by a parent resource belong to parent span from the parent resource - Implemented: PipelineRun creates spans and propagates context to TaskRuns via tekton.dev/taskrunSpanContext annotation
Reconcile of different resources must belong to separate traces - Implemented: Each resource creates its own root span when no parent context exists
```
The following need more work and are partially implemented
```
Trace all functions in the PipelineRun controller - Approximately 70% complete. Currently traced functions include ReconcileKind, durationAndCountMetrics, finishReconcileUpdateEmitEvents, resolvePipelineState, reconcile, runNextSchedulableTask, createTaskRuns, createTaskRun, createCustomRuns, createCustomRun, updateLabelsAndAnnotations, updatePipelineRunStatusFromInformer. Missing spans for pipeline resolution helpers, parameter validation, workspace processing, timeout handling, and cancellation functions.
Trace all functions in the TaskRun controller - Approximately 70% complete. Currently traced functions include ReconcileKind, durationAndCountMetrics, stopSidecars, finishReconcileUpdateEmitEvents, prepare, reconcile, updateTaskRunWithDefaultWorkspaces, updateLabelsAndAnnotations, failTaskRun, createPod. Missing spans for task resolution, validation functions, pod creation helpers, resource processing, step processing, and timeout handling.
```
Four out of six requirements are fully implemented. The tracing architecture and core functionality work perfectly. Only function coverage completeness needs work - adding spans to the remaining 30% of controller functions that don't currently have tracing.
Open questions:
- None
Done Checklist
- Acceptance criteria are met
- Non-functional properties of the Feature have been validated (such as performance, resource, UX, security or privacy aspects)
- User Journey automation is delivered
- Support and SRE teams are provided with enough skills to support the feature in production environment
- is blocked by
-
SRVKP-8712 Testing for the epic
-
- To Do
-
- is incorporated by
-
SRVKP-7109 Distributed tracing for Tasks and Pipelines
-
- New
-