Uploaded image for project: 'OpenShift Pipelines'
  1. OpenShift Pipelines
  2. SRVKP-10858

TektonInstallerSet deadlock: resources with deletionTimestamp block entire reconciliation preventing ServiceAccount creation

XMLWordPrintable

    • Pipelines Sprint CrookShank 49

      Summary

      The operator enters a deadlock state when any resource (e.g., CRD) has a deletionTimestamp during InstallerSet reconciliation. The operator aborts the entire reconciliation phase immediately, preventing critical namespace-scoped resources (ServiceAccounts, RBAC) from being created, which in turn prevents all component Deployments from starting.

      Root Cause

      Location: pkg/reconciler/kubernetes/tektoninstallerset/install.go lines 166-168

      if res.GetDeletionTimestamp() != nil {
          ressourceLogger.Debug("resource is being deleted, will reconcile again")
          return v1alpha1.RECONCILE_AGAIN_ERR  // ← BUG: Aborts entire phase
      }
      

      Problem: The ensureResources() function returns RECONCILE_AGAIN_ERR immediately when it encounters any resource with a deletionTimestamp, even if that resource is:

      • Not owned by this InstallerSet
      • Being deleted by another controller
      • A CRD stuck in TERMINATING state due to existing workloads

      This aborts the entire reconciliation phase, preventing subsequent resources in that phase from being processed.

      The Deadlock Cycle

      InstallerSet reconciliation happens in sequential phases:

      Phase 1: CRDs - If a CRD is TERMINATING, operator returns RECONCILE_AGAIN_ERR

      Phase 2: Cluster-scoped - Never reached due to Phase 1 abort

      Phase 3: Namespace-scoped - Never reached, ServiceAccounts not created

      Phase 4: Deployments - Never reached

      Result:

      • No ServiceAccounts → Deployments can't start (serviceaccount not found errors)
      • No Deployments → Webhooks don't start
      • No Webhooks → TektonConfig can't reconcile
      • Operator keeps retrying Phase 1 infinitely, fetching the same TERMINATING CRD repeatedly

      Symptoms

      When the deadlock occurs:

      • ✅ Operator pod is running
      • ✅ TektonInstallerSets are created (pipeline-main-static, pipeline-main-deployment, etc.)
      • ❌ InstallerSets stuck with status: Install failed with message: reconcile again and proceed
      • No ServiceAccounts in openshift-pipelines namespace (pipeline, tekton-pipelines-controller, tekton-pipelines-webhook)
      • No RBAC resources (Roles, RoleBindings)
      • No Deployments created
      • No pods running in openshift-pipelines
      • ❌ TektonConfig stuck in non-Ready state
      • ❌ Operator logs show infinite loop fetching CRDs: fetching resource CustomResourceDefinition

      Reproduction Scenario

      This commonly occurs when:

      Tekton CRDs exist with active workloads (PipelineRuns, TaskRuns)

      Admin attempts to delete CRDs (intentionally or during troubleshooting)

      CRDs enter TERMINATING state (Kubernetes won't delete them until workloads are cleaned up)

      Operator is reinstalled or restarted (e.g., during upgrade/downgrade)

      Operator tries to reconcile while CRDs are TERMINATING

      Deadlock occurs - operator gets stuck in infinite CRD fetch loop

      Impact

      Severity: Critical

      Affected Operations:

      • Fresh installations (if any CRDs are TERMINATING)
      • Operator upgrades/downgrades
      • Operator restarts during troubleshooting
      • Recovery from corrupted states

      Customer Impact:

      • Complete operator failure - no Tekton components running
      • Pipelines infrastructure non-functional
      • Requires manual intervention to recover (creating ServiceAccounts manually or removing finalizers)

      Steps to Reproduce

      A reproduction script is available: reproduce-deadlock.sh

      Manual steps:

      h1. 1. Install operator and wait for it to be Ready
      oc apply -f subscription.yaml
      
      h1. 2. Create test workloads
      oc create namespace test-pipelines
      cat <<EOF | oc apply -f -
      apiVersion: tekton.dev/v1
      kind: Task
      metadata:
        name: test-task
        namespace: test-pipelines
      spec:
        steps:
      *** name: echo
            image: registry.access.redhat.com/ubi9/ubi-minimal:latest
      h2.       script: echo "test"
      apiVersion: tekton.dev/v1
      kind: TaskRun
      metadata:
        name: test-taskrun
        namespace: test-pipelines
      spec:
        taskRef:
          name: test-task
      EOF
      
      h1. 3. Delete CRDs while workloads exist (creates TERMINATING state)
      oc delete crd pipelineruns.tekton.dev taskruns.tekton.dev tasks.tekton.dev &
      
      h1. 4. Wait 10 seconds for CRDs to enter TERMINATING state
      sleep 10
      
      h1. 5. Reinstall operator while CRDs are TERMINATING
      CURRENT_CSV=$(oc get csv -n openshift-operators -l operators.coreos.com/openshift-pipelines-operator-rh.openshift-operators -o jsonpath='{.items[0].metadata.name}')
      oc delete subscription openshift-pipelines-operator-rh -n openshift-operators
      oc delete csv "$CURRENT_CSV" -n openshift-operators
      oc apply -f subscription.yaml
      
      h1. 6. Wait 60 seconds and verify deadlock
      sleep 60
      oc get sa -n openshift-pipelines | grep tekton || echo "DEADLOCK: No ServiceAccounts!"
      

      Expected result: No ServiceAccounts created, no pods running, InstallerSets stuck

      Proposed Fix

      Option 1 (Recommended): Skip all terminating resources

      if res.GetDeletionTimestamp() != nil {
          ressourceLogger.Debug("resource is being deleted, skipping and continuing with other resources")
          continue  // ← Changed from: return v1alpha1.RECONCILE_AGAIN_ERR
      }
      

      Rationale:

      • Minimal code change (1 line)
      • Correct behavior: don't block reconciliation on resources being deleted by other controllers
      • If we try to recreate a resource too early, Kubernetes will reject it (409) and we'll retry naturally

      Alternative options documented in: FIX-OPTIONS-SUMMARY.md

      Workaround

      Manual recovery requires:

      Create missing ServiceAccounts manually

      Or remove finalizers from TERMINATING resources to unblock CRD deletion

      Restart operator pod

      See: fix-installerset-deadlock.sh for automated workaround script

      Related Issues

      This was discovered during investigation of OSP 1.15 to 1.14 downgrade failures (SRVKP-10509).

      Additional Context

      • Detailed root cause analysis: INSTALLERSET-DEADLOCK-ROOT-CAUSE.md
      • Fix options comparison: FIX-OPTIONS-SUMMARY.md
      • Reproduction script: reproduce-deadlock.sh
      • Code location: pkg/reconciler/kubernetes/tektoninstallerset/install.go:166-168

              jkhelil abdeljawed khelil
              jkhelil abdeljawed khelil
              Votes:
              0 Vote for this issue
              Watchers:
              2 Start watching this issue

                Created:
                Updated: