Type: Bug
Resolution: Unresolved
Priority: Critical
Fix Version/s: None
Affects Version/s: None
Component/s: Operator
Labels:
- bug
- deadlock
- doc-req
- groomed
- installerset
- operator
- release-notes-req
- serviceaccount

Story Points:
1
Epic Link:
Fix TektonInstallerSet deadlock when resources have deletionTimestamp
Blocked:
False
Blocked Reason:

Hide

None

Show
None
Ready:
False
Git Pull Request:
https://github.com/tektoncd/operator/pull/3217
Intelligence Requested:
Market:

Sprint:
Pipelines Sprint CrookShank 49, Pipelines Sprint CrookShank 50

SFDC Cases Counter:
SFDC Cases Open:
SFDC Cases Links:

Summary

The operator enters a deadlock state when any resource (e.g., CRD) has a deletionTimestamp during InstallerSet reconciliation. The operator aborts the entire reconciliation phase immediately, preventing critical namespace-scoped resources (ServiceAccounts, RBAC) from being created, which in turn prevents all component Deployments from starting.

Root Cause

Location: pkg/reconciler/kubernetes/tektoninstallerset/install.go lines 166-168

if res.GetDeletionTimestamp() != nil {
    ressourceLogger.Debug("resource is being deleted, will reconcile again")
    return v1alpha1.RECONCILE_AGAIN_ERR  // ← BUG: Aborts entire phase
}

Problem: The ensureResources() function returns RECONCILE_AGAIN_ERR immediately when it encounters any resource with a deletionTimestamp, even if that resource is:

Not owned by this InstallerSet
Being deleted by another controller
A CRD stuck in TERMINATING state due to existing workloads

This aborts the entire reconciliation phase, preventing subsequent resources in that phase from being processed.

The Deadlock Cycle

InstallerSet reconciliation happens in sequential phases:

Phase 1: CRDs - If a CRD is TERMINATING, operator returns `RECONCILE_AGAIN_ERR`

Phase 2: Cluster-scoped - Never reached due to Phase 1 abort

Phase 3: Namespace-scoped - Never reached, ServiceAccounts not created

Phase 4: Deployments - Never reached

Result:

No ServiceAccounts → Deployments can't start (serviceaccount not found errors)
No Deployments → Webhooks don't start
No Webhooks → TektonConfig can't reconcile
Operator keeps retrying Phase 1 infinitely, fetching the same TERMINATING CRD repeatedly

Symptoms

When the deadlock occurs:

✅ Operator pod is running
✅ TektonInstallerSets are created (pipeline-main-static, pipeline-main-deployment, etc.)
❌ InstallerSets stuck with status: Install failed with message: reconcile again and proceed
❌ No ServiceAccounts in openshift-pipelines namespace (pipeline, tekton-pipelines-controller, tekton-pipelines-webhook)
❌ No RBAC resources (Roles, RoleBindings)
❌ No Deployments created
❌ No pods running in openshift-pipelines
❌ TektonConfig stuck in non-Ready state
❌ Operator logs show infinite loop fetching CRDs: fetching resource CustomResourceDefinition

Reproduction Scenario

This commonly occurs when:

Tekton CRDs exist with active workloads (PipelineRuns, TaskRuns)

Admin attempts to delete CRDs (intentionally or during troubleshooting)

CRDs enter TERMINATING state (Kubernetes won't delete them until workloads are cleaned up)

Operator is reinstalled or restarted (e.g., during upgrade/downgrade)

Operator tries to reconcile while CRDs are TERMINATING

Deadlock occurs - operator gets stuck in infinite CRD fetch loop

Impact

Severity: Critical

Affected Operations:

Fresh installations (if any CRDs are TERMINATING)
Operator upgrades/downgrades
Operator restarts during troubleshooting
Recovery from corrupted states

Customer Impact:

Complete operator failure - no Tekton components running
Pipelines infrastructure non-functional
Requires manual intervention to recover (creating ServiceAccounts manually or removing finalizers)

Steps to Reproduce

A reproduction script is available: reproduce-deadlock.sh

Manual steps:

h1. 1. Install operator and wait for it to be Ready
oc apply -f subscription.yaml

h1. 2. Create test workloads
oc create namespace test-pipelines
cat <<EOF | oc apply -f -
apiVersion: tekton.dev/v1
kind: Task
metadata:
  name: test-task
  namespace: test-pipelines
spec:
  steps:
*** name: echo
      image: registry.access.redhat.com/ubi9/ubi-minimal:latest
h2.       script: echo "test"
apiVersion: tekton.dev/v1
kind: TaskRun
metadata:
  name: test-taskrun
  namespace: test-pipelines
spec:
  taskRef:
    name: test-task
EOF

h1. 3. Delete CRDs while workloads exist (creates TERMINATING state)
oc delete crd pipelineruns.tekton.dev taskruns.tekton.dev tasks.tekton.dev &

h1. 4. Wait 10 seconds for CRDs to enter TERMINATING state
sleep 10

h1. 5. Reinstall operator while CRDs are TERMINATING
CURRENT_CSV=$(oc get csv -n openshift-operators -l operators.coreos.com/openshift-pipelines-operator-rh.openshift-operators -o jsonpath='{.items[0].metadata.name}')
oc delete subscription openshift-pipelines-operator-rh -n openshift-operators
oc delete csv "$CURRENT_CSV" -n openshift-operators
oc apply -f subscription.yaml

h1. 6. Wait 60 seconds and verify deadlock
sleep 60
oc get sa -n openshift-pipelines | grep tekton || echo "DEADLOCK: No ServiceAccounts!"

Expected result: No ServiceAccounts created, no pods running, InstallerSets stuck

Proposed Fix

Option 1 (Recommended): Skip all terminating resources

if res.GetDeletionTimestamp() != nil {
    ressourceLogger.Debug("resource is being deleted, skipping and continuing with other resources")
    continue  // ← Changed from: return v1alpha1.RECONCILE_AGAIN_ERR
}

Rationale:

Minimal code change (1 line)
Correct behavior: don't block reconciliation on resources being deleted by other controllers
If we try to recreate a resource too early, Kubernetes will reject it (409) and we'll retry naturally

Alternative options documented in: FIX-OPTIONS-SUMMARY.md

Workaround

Manual recovery requires:

Create missing ServiceAccounts manually

Or remove finalizers from TERMINATING resources to unblock CRD deletion

Restart operator pod

See: fix-installerset-deadlock.sh for automated workaround script

Related Issues

This was discovered during investigation of OSP 1.15 to 1.14 downgrade failures (~~SRVKP-10509~~).

Additional Context

Detailed root cause analysis: INSTALLERSET-DEADLOCK-ROOT-CAUSE.md
Fix options comparison: FIX-OPTIONS-SUMMARY.md
Reproduction script: reproduce-deadlock.sh
Code location: pkg/reconciler/kubernetes/tektoninstallerset/install.go:166-168

relates to

SRVKP-10509 ServiceMonitor has hardcoded openshift-operators in namespaceSelector, causing Prometheus failures when operator is installed in different namespace

Verified

Details

Description