Uploaded image for project: 'OpenShift Request For Enhancement'
  1. OpenShift Request For Enhancement
  2. RFE-8411

Provide configuration option to disable automatic finalizer injection during OpenShift Pipelines Operator upgrades

XMLWordPrintable

    • Icon: Feature Request Feature Request
    • Resolution: Unresolved
    • Icon: Normal Normal
    • None
    • all
    • Pipelines
    • None
    • None
    • Product / Portfolio Work
    • None
    • False
    • Hide

      None

      Show
      None
    • None
    • None
    • None
    • None
    • None
    • None
    • None
    • None

      Problem Statement

      When upgrading OpenShift Pipelines Operator across versions that introduce new features (e.g., Tekton Results, Tekton Chains), these features automatically add finalizers to all existing PipelineRun resources in the cluster. This automatic behavior creates significant operational challenges for large-scale production deployments:

      1. Automatic Finalizer Injection: Features like Tekton Results are enabled by default during operator upgrades (e.g., 1.15 → 1.19), causing finalizers to be automatically added to all existing PipelineRuns
      2. Deletion Blockage: PipelineRuns with finalizers cannot be deleted normally, requiring manual finalizer cleanup
      3. Persistent State: Even after manual finalizer removal, data persists in associated PVCs
      4. Forced Cluster Evacuation: Organizations must empty entire lower environment clusters before upgrades, disable the new features, perform the upgrade, then manually re-enable features to ensure new PipelineRuns start clean

      Requested Enhancement

      Provide a configuration option at the operator level to control finalizer behavior during upgrades:

      • Default Behavior (unchanged): New features enabled by default with automatic finalizer injection (preserves current behavior for most users)
      • Opt-Out Configuration: Allow operators to disable automatic finalizer injection during upgrade, enabling manual feature enablement post-upgrade

      This would allow large-scale deployments to:

      1. Upgrade the operator with new features initially disabled
      2. Allow existing PipelineRuns to complete and clean up naturally
      3. Manually enable new features (Tekton Results, Tekton Chains) after upgrade stabilization
      4. Ensure only new PipelineRuns created after manual enablement receive finalizers

      Proposed Implementation Options

      Option 1: Operator-Level Configuration

       

      apiVersion: operator.tekton.dev/v1alpha1kind: TektonConfigmetadata:name: configspec:pipeline:enable-new-features-on-upgrade: false # Default: true for backward compatibility

      Option 2: Feature-Specific Opt-In

       

      apiVersion: operator.tekton.dev/v1alpha1kind: TektonConfigmetadata:name: configspec:pipeline:results:auto-enable-on-upgrade: false # Explicit control per featurechains:auto-enable-on-upgrade: false

      Option 3: Upgrade Annotation

       

      apiVersion: operators.coreos.com/v1alpha1kind: Subscriptionmetadata:name: openshift-pipelines-operatorannotations:pipelines.openshift.io/disable-auto-features: "true"


      Business Requirements and Justification

      Industry Impact

      This issue affects any large-scale OpenShift Pipelines deployment with:

      • High PipelineRun volume (thousands per day)
      • Long-lived clusters with accumulated PipelineRuns
      • Strict change management requirements
      • Production environments with zero-downtime expectations

      Estimated Affected Customer Profile:

      • Enterprise CI/CD platforms (similar to Citi's scale)
      • Multi-tenant build platforms
      • Organizations with regulatory/compliance constraints on data cleanup
      • Customers running Pipelines Operator across major version upgrades

      Strategic Value

      1. Reduces Upgrade Friction: Simpler upgrade path encourages regular updates, improving security posture
      2. Enterprise Adoption: Demonstrates Red Hat's understanding of enterprise operational requirements
      3. Competitive Differentiation: Mature operational controls vs. upstream Tekton community edition
      4. Customer Retention: Addresses pain point for strategic customers at scale
      5. Support Load Reduction: Fewer support cases related to upgrade complications

      4. Affected Packages and Components

      Primary Components

      • OpenShift Pipelines Operator (openshift-pipelines-operator)
        • Subscription and lifecycle management
        • Feature enablement logic
        • Upgrade orchestration
      • Tekton Pipeline Controller (tekton-pipelines)
        • PipelineRun reconciliation
        • Finalizer injection logic
        • Resource lifecycle management
      • Tekton Results (tekton-results)
        • Results storage and API
        • Finalizer behavior on PipelineRuns
        • PVC data persistence
      • Tekton Chains (tekton-chains)
        • Supply chain security attestation
        • Finalizer behavior on PipelineRuns
        • Attestation storage

      Secondary Components

      • TektonConfig CRD
        • Configuration schema updates
        • Feature flag definitions
        • Validation logic
      • Operator Webhook
        • Admission control for configuration
        • Validation of feature enablement settings
        • Migration path handling

      Documentation Updates Required

      • Upgrade guides with new configuration options
      • Migration documentation for existing deployments
      • Best practices for large-scale deployments
      • Troubleshooting guide for finalizer issues

      Additional Technical Details

      Current Behavior Analysis

      Version 1.15 → 1.19 Upgrade Example:

      1. Operator upgraded via OLM subscription
      2. Tekton Results enabled by default in 1.19
      3. Results controller adds finalizers to all existing PipelineRuns (retroactive)
      4. Existing PipelineRuns cannot be deleted via normal TTL/cleanup
      5. Manual intervention required:
        # Current workaround (per PipelineRun)kubectl patch pipelinerun <name> -p '{"metadata":{"finalizers":null' --type=merge}}

      Data Persistence Issue: Even after finalizer removal, Tekton Results data persists in:

      • Results API database (if configured)
      • PVCs created for results storage
      • Requires separate cleanup of storage resources

      Desired Behavior

      Upgrade Path with Opt-Out:

      1. Operator upgraded via OLM subscription
      2. Configuration flag prevents auto-enablement of Tekton Results
      3. Existing PipelineRuns unaffected, complete naturally
      4. Operator manually enables Tekton Results after upgrade validation
      5. Only new PipelineRuns (post-enablement) receive finalizers
      6. Clean separation between pre-upgrade and post-upgrade workloads

      Edge Cases to Consider

      1. Mid-Upgrade PipelineRuns
        • PipelineRuns created during upgrade window
        • Behavior should be deterministic based on configuration state
      1. Feature Re-Enablement
        • Clear documentation on enabling features post-upgrade
        • Validation that configuration changes don't affect existing PipelineRuns
      1. Rollback Scenarios
        • Operator rollback behavior with feature flags
        • State consistency after rollback
      1. Multi-Namespace Impact
        • Cluster-scoped operator affecting all namespaces
        • Consistent behavior across all PipelineRuns cluster-wide

      Success Criteria

      Functional Requirements

      • [ ] Configuration option available to disable automatic feature enablement during upgrades
      • [ ] Existing PipelineRuns unaffected by disabled features
      • [ ] Manual feature enablement works post-upgrade
      • [ ] Only new PipelineRuns receive finalizers after manual enablement
      • [ ] Backward compatible (default behavior unchanged)

      Operational Requirements

      • [ ] No cluster evacuation required before upgrades
      • [ ] Upgrade process completable within standard maintenance window
      • [ ] Clear upgrade documentation for both default and opt-out paths
      • [ ] Validation tooling to verify configuration before upgrade

      Performance Requirements

      • [ ] No performance degradation compared to current upgrade process
      • [ ] Minimal additional configuration complexity
      • [ ] Clear error messages if misconfigured

      Workarounds and Current State

      Current Workaround Process

      1. Pre-Upgrade Phase
        • Identify all lower environment clusters requiring upgrade
        • Schedule extended maintenance window (4-6 hours)
        • Notify developers of build platform outage
      1. Cluster Evacuation Phase
        • Disable new PipelineRun creation
        • Wait for in-flight PipelineRuns to complete
        • Delete all completed PipelineRuns
        • Verify cluster empty of PipelineRuns
        • Backup PVC data (optional)
      1. Upgrade Phase
        • Perform operator upgrade via OLM
        • Disable Tekton Results and Tekton Chains immediately post-upgrade
        • Validate operator health
      1. Post-Upgrade Phase
        • Re-enable PipelineRun creation
        • Monitor for issues
        • Manually enable features on case-by-case basis for new workloads
      1. Cleanup Phase
        • Identify and remove orphaned PVCs
        • Document finalizer issues for next upgrade cycle

      Problems with Current Workaround:

      • Labor-intensive (requires 2-3 engineers for each cluster)
      • High-risk manual process
      • Extended outage windows unacceptable for production
      • Not scalable across multiple clusters
      • Does not address root cause

      Alternative Workarounds Considered

      Option A: Stay on Older Versions

      • Status: Not viable long-term
      • Issues: Security vulnerabilities, missing features, limited support

      Option B: Automated Finalizer Cleanup

      • Status: Implemented as stopgap
      • Issues: Treats symptom not root cause, fragile scripts, PVC data still orphaned

      Option C: Separate Clusters for New Features

      • Status: Not practical
      • Issues: Resource multiplication, operational complexity, cost prohibitive

      References and Supporting Information

      Related Documentation

       

              rh-ee-ssadeghi Siamak Sadeghianfar
              rhn-support-gvaughn Grimm Greysson
              None
              Votes:
              0 Vote for this issue
              Watchers:
              1 Start watching this issue

                Created:
                Updated:
                None
                None