Loading...

Type: Feature Request
Resolution: Unresolved
Priority: Normal
Fix Version/s: None
Affects Version/s: all
Component/s: Pipelines
Labels:
None

Target Version:
None
Activity Type:
Product / Portfolio Work
Status Summary:
None
Blocked:
False
Blocked Reason:

Hide

None

Show
None
Products:
None
Hierarchy Progress Bar:
None

SFDC Cases Counter:
SFDC Cases Open:
SFDC Cases Links:

PX Review Complete:
None
PX Impact Score:
PX Impact Range:
None
PX Priority Data:
None
PX Technical Impact:
None
PX Technical Impact Notes:
None
PX Scheduling Request:
None

Problem Statement

When upgrading OpenShift Pipelines Operator across versions that introduce new features (e.g., Tekton Results, Tekton Chains), these features automatically add finalizers to all existing PipelineRun resources in the cluster. This automatic behavior creates significant operational challenges for large-scale production deployments:

Automatic Finalizer Injection: Features like Tekton Results are enabled by default during operator upgrades (e.g., 1.15 → 1.19), causing finalizers to be automatically added to all existing PipelineRuns
Deletion Blockage: PipelineRuns with finalizers cannot be deleted normally, requiring manual finalizer cleanup
Persistent State: Even after manual finalizer removal, data persists in associated PVCs
Forced Cluster Evacuation: Organizations must empty entire lower environment clusters before upgrades, disable the new features, perform the upgrade, then manually re-enable features to ensure new PipelineRuns start clean

Requested Enhancement

Provide a configuration option at the operator level to control finalizer behavior during upgrades:

Default Behavior (unchanged): New features enabled by default with automatic finalizer injection (preserves current behavior for most users)
Opt-Out Configuration: Allow operators to disable automatic finalizer injection during upgrade, enabling manual feature enablement post-upgrade

This would allow large-scale deployments to:

Upgrade the operator with new features initially disabled
Allow existing PipelineRuns to complete and clean up naturally
Manually enable new features (Tekton Results, Tekton Chains) after upgrade stabilization
Ensure only new PipelineRuns created after manual enablement receive finalizers

Proposed Implementation Options

Option 1: Operator-Level Configuration

apiVersion: operator.tekton.dev/v1alpha1kind: TektonConfigmetadata:name: configspec:pipeline:enable-new-features-on-upgrade: false # Default: true for backward compatibility

Option 2: Feature-Specific Opt-In

apiVersion: operator.tekton.dev/v1alpha1kind: TektonConfigmetadata:name: configspec:pipeline:results:auto-enable-on-upgrade: false # Explicit control per featurechains:auto-enable-on-upgrade: false

Option 3: Upgrade Annotation

apiVersion: operators.coreos.com/v1alpha1kind: Subscriptionmetadata:name: openshift-pipelines-operatorannotations:pipelines.openshift.io/disable-auto-features: "true"

Business Requirements and Justification

Industry Impact

This issue affects any large-scale OpenShift Pipelines deployment with:

High PipelineRun volume (thousands per day)
Long-lived clusters with accumulated PipelineRuns
Strict change management requirements
Production environments with zero-downtime expectations

Estimated Affected Customer Profile:

Enterprise CI/CD platforms (similar to Citi's scale)
Multi-tenant build platforms
Organizations with regulatory/compliance constraints on data cleanup
Customers running Pipelines Operator across major version upgrades

Strategic Value

Reduces Upgrade Friction: Simpler upgrade path encourages regular updates, improving security posture
Enterprise Adoption: Demonstrates Red Hat's understanding of enterprise operational requirements
Competitive Differentiation: Mature operational controls vs. upstream Tekton community edition
Customer Retention: Addresses pain point for strategic customers at scale
Support Load Reduction: Fewer support cases related to upgrade complications

4. Affected Packages and Components

Primary Components

OpenShift Pipelines Operator (openshift-pipelines-operator)

- Subscription and lifecycle management
- Feature enablement logic
- Upgrade orchestration
Tekton Pipeline Controller (tekton-pipelines)

- PipelineRun reconciliation
- Finalizer injection logic
- Resource lifecycle management
Tekton Results (tekton-results)

- Results storage and API
- Finalizer behavior on PipelineRuns
- PVC data persistence
Tekton Chains (tekton-chains)

- Supply chain security attestation
- Finalizer behavior on PipelineRuns
- Attestation storage

Secondary Components

TektonConfig CRD

- Configuration schema updates
- Feature flag definitions
- Validation logic
Operator Webhook

- Admission control for configuration
- Validation of feature enablement settings
- Migration path handling

Documentation Updates Required

Upgrade guides with new configuration options
Migration documentation for existing deployments
Best practices for large-scale deployments
Troubleshooting guide for finalizer issues

Additional Technical Details

Current Behavior Analysis

Version 1.15 → 1.19 Upgrade Example:

Operator upgraded via OLM subscription
Tekton Results enabled by default in 1.19
Results controller adds finalizers to all existing PipelineRuns (retroactive)
Existing PipelineRuns cannot be deleted via normal TTL/cleanup
Manual intervention required:
# Current workaround (per PipelineRun)kubectl patch pipelinerun <name> -p '{"metadata":{"finalizers":null' --type=merge}}

Data Persistence Issue: Even after finalizer removal, Tekton Results data persists in:

Results API database (if configured)
PVCs created for results storage
Requires separate cleanup of storage resources

Desired Behavior

Upgrade Path with Opt-Out:

Operator upgraded via OLM subscription
Configuration flag prevents auto-enablement of Tekton Results
Existing PipelineRuns unaffected, complete naturally
Operator manually enables Tekton Results after upgrade validation
Only new PipelineRuns (post-enablement) receive finalizers
Clean separation between pre-upgrade and post-upgrade workloads

Edge Cases to Consider

Mid-Upgrade PipelineRuns

- PipelineRuns created during upgrade window
- Behavior should be deterministic based on configuration state

Feature Re-Enablement

- Clear documentation on enabling features post-upgrade
- Validation that configuration changes don't affect existing PipelineRuns

Rollback Scenarios

- Operator rollback behavior with feature flags
- State consistency after rollback

Multi-Namespace Impact

- Cluster-scoped operator affecting all namespaces
- Consistent behavior across all PipelineRuns cluster-wide

Success Criteria

Functional Requirements

[ ] Configuration option available to disable automatic feature enablement during upgrades
[ ] Existing PipelineRuns unaffected by disabled features
[ ] Manual feature enablement works post-upgrade
[ ] Only new PipelineRuns receive finalizers after manual enablement
[ ] Backward compatible (default behavior unchanged)

Operational Requirements

[ ] No cluster evacuation required before upgrades
[ ] Upgrade process completable within standard maintenance window
[ ] Clear upgrade documentation for both default and opt-out paths
[ ] Validation tooling to verify configuration before upgrade

Performance Requirements

[ ] No performance degradation compared to current upgrade process
[ ] Minimal additional configuration complexity
[ ] Clear error messages if misconfigured

Workarounds and Current State

Current Workaround Process

Pre-Upgrade Phase

- Identify all lower environment clusters requiring upgrade
- Schedule extended maintenance window (4-6 hours)
- Notify developers of build platform outage

Cluster Evacuation Phase

- Disable new PipelineRun creation
- Wait for in-flight PipelineRuns to complete
- Delete all completed PipelineRuns
- Verify cluster empty of PipelineRuns
- Backup PVC data (optional)

Upgrade Phase

- Perform operator upgrade via OLM
- Disable Tekton Results and Tekton Chains immediately post-upgrade
- Validate operator health

Post-Upgrade Phase

- Re-enable PipelineRun creation
- Monitor for issues
- Manually enable features on case-by-case basis for new workloads

Cleanup Phase

- Identify and remove orphaned PVCs
- Document finalizer issues for next upgrade cycle

Problems with Current Workaround:

Labor-intensive (requires 2-3 engineers for each cluster)
High-risk manual process
Extended outage windows unacceptable for production
Not scalable across multiple clusters
Does not address root cause

Alternative Workarounds Considered

Option A: Stay on Older Versions

Status: Not viable long-term
Issues: Security vulnerabilities, missing features, limited support

Option B: Automated Finalizer Cleanup

Status: Implemented as stopgap
Issues: Treats symptom not root cause, fragile scripts, PVC data still orphaned

Option C: Separate Clusters for New Features

Status: Not practical
Issues: Resource multiplication, operational complexity, cost prohibitive

Provide configuration option to disable automatic finalizer injection during OpenShift Pipelines Operator upgrades

Problem Statement

Requested Enhancement

Proposed Implementation Options

Business Requirements and Justification

Industry Impact

Strategic Value

4. Affected Packages and Components

Primary Components

Secondary Components

Documentation Updates Required

Additional Technical Details

Current Behavior Analysis

Desired Behavior

Edge Cases to Consider

Success Criteria

Functional Requirements

Operational Requirements

Performance Requirements

Workarounds and Current State

Current Workaround Process

Alternative Workarounds Considered

References and Supporting Information

Related Documentation

Details

Description

Problem Statement

Requested Enhancement

Proposed Implementation Options

Business Requirements and Justification

Industry Impact

Strategic Value

4. Affected Packages and Components

Primary Components

Secondary Components

Documentation Updates Required

Additional Technical Details

Current Behavior Analysis

Desired Behavior

Edge Cases to Consider

Success Criteria

Functional Requirements

Operational Requirements

Performance Requirements

Workarounds and Current State

Current Workaround Process

Alternative Workarounds Considered

References and Supporting Information

Related Documentation

Attachments

Easy Agile Planning Poker

Activity

People

Dates