-
Feature Request
-
Resolution: Unresolved
-
Normal
-
None
-
all
-
None
-
None
-
Product / Portfolio Work
-
None
-
False
-
-
None
-
None
-
None
-
-
None
-
None
-
None
-
None
-
None
Problem Statement
When upgrading OpenShift Pipelines Operator across versions that introduce new features (e.g., Tekton Results, Tekton Chains), these features automatically add finalizers to all existing PipelineRun resources in the cluster. This automatic behavior creates significant operational challenges for large-scale production deployments:
- Automatic Finalizer Injection: Features like Tekton Results are enabled by default during operator upgrades (e.g., 1.15 → 1.19), causing finalizers to be automatically added to all existing PipelineRuns
- Deletion Blockage: PipelineRuns with finalizers cannot be deleted normally, requiring manual finalizer cleanup
- Persistent State: Even after manual finalizer removal, data persists in associated PVCs
- Forced Cluster Evacuation: Organizations must empty entire lower environment clusters before upgrades, disable the new features, perform the upgrade, then manually re-enable features to ensure new PipelineRuns start clean
Requested Enhancement
Provide a configuration option at the operator level to control finalizer behavior during upgrades:
- Default Behavior (unchanged): New features enabled by default with automatic finalizer injection (preserves current behavior for most users)
- Opt-Out Configuration: Allow operators to disable automatic finalizer injection during upgrade, enabling manual feature enablement post-upgrade
This would allow large-scale deployments to:
- Upgrade the operator with new features initially disabled
- Allow existing PipelineRuns to complete and clean up naturally
- Manually enable new features (Tekton Results, Tekton Chains) after upgrade stabilization
- Ensure only new PipelineRuns created after manual enablement receive finalizers
Proposed Implementation Options
Option 1: Operator-Level Configuration
apiVersion: operator.tekton.dev/v1alpha1kind: TektonConfigmetadata:name: configspec:pipeline:enable-new-features-on-upgrade: false # Default: true for backward compatibility
Option 2: Feature-Specific Opt-In
apiVersion: operator.tekton.dev/v1alpha1kind: TektonConfigmetadata:name: configspec:pipeline:results:auto-enable-on-upgrade: false # Explicit control per featurechains:auto-enable-on-upgrade: false
Option 3: Upgrade Annotation
apiVersion: operators.coreos.com/v1alpha1kind: Subscriptionmetadata:name: openshift-pipelines-operatorannotations:pipelines.openshift.io/disable-auto-features: "true"
Business Requirements and Justification
Industry Impact
This issue affects any large-scale OpenShift Pipelines deployment with:
- High PipelineRun volume (thousands per day)
- Long-lived clusters with accumulated PipelineRuns
- Strict change management requirements
- Production environments with zero-downtime expectations
Estimated Affected Customer Profile:
- Enterprise CI/CD platforms (similar to Citi's scale)
- Multi-tenant build platforms
- Organizations with regulatory/compliance constraints on data cleanup
- Customers running Pipelines Operator across major version upgrades
Strategic Value
- Reduces Upgrade Friction: Simpler upgrade path encourages regular updates, improving security posture
- Enterprise Adoption: Demonstrates Red Hat's understanding of enterprise operational requirements
- Competitive Differentiation: Mature operational controls vs. upstream Tekton community edition
- Customer Retention: Addresses pain point for strategic customers at scale
- Support Load Reduction: Fewer support cases related to upgrade complications
4. Affected Packages and Components
Primary Components
- OpenShift Pipelines Operator (openshift-pipelines-operator)
-
- Subscription and lifecycle management
- Feature enablement logic
- Upgrade orchestration
- Tekton Pipeline Controller (tekton-pipelines)
-
- PipelineRun reconciliation
- Finalizer injection logic
- Resource lifecycle management
- Tekton Results (tekton-results)
-
- Results storage and API
- Finalizer behavior on PipelineRuns
- PVC data persistence
- Tekton Chains (tekton-chains)
-
- Supply chain security attestation
- Finalizer behavior on PipelineRuns
- Attestation storage
Secondary Components
- TektonConfig CRD
-
- Configuration schema updates
- Feature flag definitions
- Validation logic
- Operator Webhook
-
- Admission control for configuration
- Validation of feature enablement settings
- Migration path handling
Documentation Updates Required
- Upgrade guides with new configuration options
- Migration documentation for existing deployments
- Best practices for large-scale deployments
- Troubleshooting guide for finalizer issues
Additional Technical Details
Current Behavior Analysis
Version 1.15 → 1.19 Upgrade Example:
- Operator upgraded via OLM subscription
- Tekton Results enabled by default in 1.19
- Results controller adds finalizers to all existing PipelineRuns (retroactive)
- Existing PipelineRuns cannot be deleted via normal TTL/cleanup
- Manual intervention required:
# Current workaround (per PipelineRun)kubectl patch pipelinerun <name> -p '{"metadata":{"finalizers":null' --type=merge}}
Data Persistence Issue: Even after finalizer removal, Tekton Results data persists in:
- Results API database (if configured)
- PVCs created for results storage
- Requires separate cleanup of storage resources
Desired Behavior
Upgrade Path with Opt-Out:
- Operator upgraded via OLM subscription
- Configuration flag prevents auto-enablement of Tekton Results
- Existing PipelineRuns unaffected, complete naturally
- Operator manually enables Tekton Results after upgrade validation
- Only new PipelineRuns (post-enablement) receive finalizers
- Clean separation between pre-upgrade and post-upgrade workloads
Edge Cases to Consider
- Mid-Upgrade PipelineRuns
-
- PipelineRuns created during upgrade window
- Behavior should be deterministic based on configuration state
- Feature Re-Enablement
-
- Clear documentation on enabling features post-upgrade
- Validation that configuration changes don't affect existing PipelineRuns
- Rollback Scenarios
-
- Operator rollback behavior with feature flags
- State consistency after rollback
- Multi-Namespace Impact
-
- Cluster-scoped operator affecting all namespaces
- Consistent behavior across all PipelineRuns cluster-wide
Success Criteria
Functional Requirements
- [ ] Configuration option available to disable automatic feature enablement during upgrades
- [ ] Existing PipelineRuns unaffected by disabled features
- [ ] Manual feature enablement works post-upgrade
- [ ] Only new PipelineRuns receive finalizers after manual enablement
- [ ] Backward compatible (default behavior unchanged)
Operational Requirements
- [ ] No cluster evacuation required before upgrades
- [ ] Upgrade process completable within standard maintenance window
- [ ] Clear upgrade documentation for both default and opt-out paths
- [ ] Validation tooling to verify configuration before upgrade
Performance Requirements
- [ ] No performance degradation compared to current upgrade process
- [ ] Minimal additional configuration complexity
- [ ] Clear error messages if misconfigured
Workarounds and Current State
Current Workaround Process
- Pre-Upgrade Phase
-
- Identify all lower environment clusters requiring upgrade
- Schedule extended maintenance window (4-6 hours)
- Notify developers of build platform outage
- Cluster Evacuation Phase
-
- Disable new PipelineRun creation
- Wait for in-flight PipelineRuns to complete
- Delete all completed PipelineRuns
- Verify cluster empty of PipelineRuns
- Backup PVC data (optional)
- Upgrade Phase
-
- Perform operator upgrade via OLM
- Disable Tekton Results and Tekton Chains immediately post-upgrade
- Validate operator health
- Post-Upgrade Phase
-
- Re-enable PipelineRun creation
- Monitor for issues
- Manually enable features on case-by-case basis for new workloads
- Cleanup Phase
-
- Identify and remove orphaned PVCs
- Document finalizer issues for next upgrade cycle
Problems with Current Workaround:
- Labor-intensive (requires 2-3 engineers for each cluster)
- High-risk manual process
- Extended outage windows unacceptable for production
- Not scalable across multiple clusters
- Does not address root cause
Alternative Workarounds Considered
Option A: Stay on Older Versions
- Status: Not viable long-term
- Issues: Security vulnerabilities, missing features, limited support
Option B: Automated Finalizer Cleanup
- Status: Implemented as stopgap
- Issues: Treats symptom not root cause, fragile scripts, PVC data still orphaned
Option C: Separate Clusters for New Features
- Status: Not practical
- Issues: Resource multiplication, operational complexity, cost prohibitive
References and Supporting Information
Related Documentation