Loading...

Type: Epic
Resolution: Unresolved
Priority: Major
Fix Version/s: Pipelines 1.22.0
Affects Version/s: None
Component/s: Documentation, Operator, Tekton Pipelines
Labels:
- pm-review-requested

Epic Name:
Distributed Tracing documentation and enhancement for Openshift Pipelines
Blocked:
False
Blocked Reason:

Hide

None

Show
None
Ready:
False
Parent Link:
SRVKP-7109Distributed tracing for Tasks and Pipelines
Epic Status:
To Do
Feature Link:
SRVKP-7109 - Distributed tracing for Tasks and Pipelines
Hierarchy Progress Bar:

97% To Do, 0% In Progress, 3% Done
Intelligence Requested:
Market:

SFDC Cases Links:
SFDC Cases Counter:
SFDC Cases Open:

Epic Goal

Distributed tracing across Tekton Pipeline controllers and resource paths (PipelineRun/TaskRun, ResolutionRequest, CustomRun, pod/entrypoint), so developers can optimize reconcilers and users can observe pipelines with Jaeger/OTLP backends.

Why is this important?

Customer request and e2e tracing implementation

Scenarios

Need to add

Acceptance Criteria (Mandatory)

Coverage & propagation

Tracing present in PipelineRun, TaskRun, ResolutionRequest, CustomRun controllers; key resource functions (parameters, workspaces, results, when-expressions), pod/entrypoint flows.

W3C trace context propagated via HTTP headers and resource annotations; preserved through webhooks and downstream work.

Correlation & observability

Trace and span IDs are attachable to Prometheus metrics and Kubernetes Events to allow click-through from metrics/events to traces.

Users can view a complete path from PipelineRun to pod creation in the tracing UI.

Error/timeout semantics

Errors recorded consistently with type/category (user vs system), messages, and (where unexpected) stack traces. Timeouts and cancellations traced with reason metadata.

Performance & configuration

Head/rate-limited sampling strategies are configurable (service name/env, Jaeger HTTP/gRPC exporters), with <≈5% overhead verified by benchmarks; attribute collection uses lazy/size-bounded patterns.

Configuration is validated; misconfigurations produce actionable errors; docs and examples updated.

Quality bar

Unit + integration tests for propagation, controller spans, resource-path spans, correlation hooks; reusable test utilities/mocks; 70+% coverage on tracing code.

OpenShift-specific

- Works out-of-the-box on OpenShift with: OTEL Collector + Jaeger/Tempo Operator backends; OpenShift Pipelines Operator supported versions.

- Kustomize overlays for OpenShift install (config-tracing, RBAC, NetworkPolicy, TLS).

- Supports cluster proxy/TLS, namespace multi-tenancy, and Red Hat image/FIPS constraints.

- Red Hat docs include OpenShift “day-2” guidance: sampling knobs, backend options, troubleshooting.

Dependencies (internal and external)

Sequence (inter-story)

Phases:-

1: Foundation and Configuration

2: ResolutionRequest Controller Instrumentation

3: CustomRun Controller Instrumentation

4: PipelineRun Resource Functions Instrumentation

5: TaskRun Resource Functions Instrumentation

6: Pod Operations Instrumentation

7: Timeout Instrumentation

8: Metrics and Events Correlation

9: Performance Optimization

Foundation (Phase-1: config, test infra, propagation) completes first; Phases 2–6 (controllers/resources/pod) can proceed in parallel; correlation (Phase-8) and enhanced error/timeout (Phase-7) depend on those; performance work (Phase-9) can run alongside 7–8.

Foundational tech

Config plumbing for service name/env, Jaeger exporter (HTTP+gRPC), sampling; config validation & docs.

Test harness: mock tracer, span verifiers, in-memory exporter; trace-propagation tests available for all stories.

Downstream touchpoints

Resource-path work (params/workspaces/results/when) depends on base tracer in controllers and propagation in place.

Correlation hooks depend on metrics/event emission code paths being trace-ID aware.

OpenShift integration (for the newly added stories)

Cluster components: OpenTelemetry Collector deployment, Jaeger or Tempo (via Operator), and OpenShift Pipelines Operator at required minima.

Platform constraints: cluster proxy settings, CA bundles, OAuth-backed UI routing where applicable, and namespace isolation (multi-tenant).

Packaging: OpenShift overlays/kustomizations; RBAC/NetworkPolicy updates; optional ServiceMesh/ingress specifics if traces egress via mesh.

Release/docs: Red Hat-specific guides, example ConfigMaps/Secrets, and support matrix noted in product docs.

Previous Work (Optional):

The following is the current state of this feature, which I am adding as a note here.

Currently, we have controller-level distributed tracing with function spans for PipelineRun/TaskRun reconciliation cycles, Jaeger backend support via OTLP, and trace propagation between resources.

Our areas of further work include Task execution tracing (the actual pod/step execution traces), OTEL environment variable configuration support, configurable tracing levels, and user-injectable custom traces.

From what I understand, the current OTLP HTTP tracing implementation in Tekton is already compatible with the OpenShift observability platform because OpenShift natively supports OTLP ingestion. What needs work is configuration updates (pointing to Cluster Observability Operator endpoints instead of direct Jaeger) and documentation updates (showing how to set up with the new operator instead of a standalone Jaeger installation).

This would require an EPIC to be created for initial documentation, initial integration of OpenShift observability, and which can be tested by the customer, and then implementation of the rest of the traces and whichever ones the customer might require on priority. Among the requirements we have we support the following ones currently and need to probably be documented better with integraiton with cluster observability.
```
Support Jaeger backend - Fully implemented via OTLP HTTP exporter that works with Jaeger

Propagate traces so that subsequent reconciles of the same resource belong to the same trace - Implemented via SpanContext storage in resource status and annotation-based propagation

Propagate traces so that reconciles of a resource owned by a parent resource belong to parent span from the parent resource - Implemented: PipelineRun creates spans and propagates context to TaskRuns via tekton.dev/taskrunSpanContext annotation
Reconcile of different resources must belong to separate traces - Implemented: Each resource creates its own root span when no parent context exists
```

The following need more work and are partially implemented

```
Trace all functions in the PipelineRun controller - Approximately 70% complete. Currently traced functions include ReconcileKind, durationAndCountMetrics, finishReconcileUpdateEmitEvents, resolvePipelineState, reconcile, runNextSchedulableTask, createTaskRuns, createTaskRun, createCustomRuns, createCustomRun, updateLabelsAndAnnotations, updatePipelineRunStatusFromInformer. Missing spans for pipeline resolution helpers, parameter validation, workspace processing, timeout handling, and cancellation functions.

Trace all functions in the TaskRun controller - Approximately 70% complete. Currently traced functions include ReconcileKind, durationAndCountMetrics, stopSidecars, finishReconcileUpdateEmitEvents, prepare, reconcile, updateTaskRunWithDefaultWorkspaces, updateLabelsAndAnnotations, failTaskRun, createPod. Missing spans for task resolution, validation functions, pod creation helpers, resource processing, step processing, and timeout handling.
```

Four out of six requirements are fully implemented. The tracing architecture and core functionality work perfectly. Only function coverage completeness needs work - adding spans to the remaining 30% of controller functions that don't currently have tracing.

Open questions:

None

Done Checklist

Acceptance criteria are met
Non-functional properties of the Feature have been validated (such as performance, resource, UX, security or privacy aspects)
User Journey automation is delivered
Support and SRE teams are provided with enough skills to support the feature in production environment

is blocked by

SRVKP-8712 Testing for the epic

To Do

is incorporated by

SRVKP-7109 Distributed tracing for Tasks and Pipelines

New

Details

Description

Epic Goal

Why is this important?

Scenarios

Acceptance Criteria (Mandatory)

Dependencies (internal and external)

Previous Work (Optional):

Open questions:

Done Checklist

Attachments

Issue Links

Easy Agile Planning Poker

Activity

People

Dates

Hide