-
Bug
-
Resolution: Unresolved
-
Critical
-
None
-
None
-
3
-
False
-
-
False
-
-
-
Pipelines Sprint CrookShank 49
Summary
The operator enters a deadlock state when any resource (e.g., CRD) has a deletionTimestamp during InstallerSet reconciliation. The operator aborts the entire reconciliation phase immediately, preventing critical namespace-scoped resources (ServiceAccounts, RBAC) from being created, which in turn prevents all component Deployments from starting.
Root Cause
Location: pkg/reconciler/kubernetes/tektoninstallerset/install.go lines 166-168
if res.GetDeletionTimestamp() != nil { ressourceLogger.Debug("resource is being deleted, will reconcile again") return v1alpha1.RECONCILE_AGAIN_ERR // ← BUG: Aborts entire phase }
Problem: The ensureResources() function returns RECONCILE_AGAIN_ERR immediately when it encounters any resource with a deletionTimestamp, even if that resource is:
- Not owned by this InstallerSet
- Being deleted by another controller
- A CRD stuck in TERMINATING state due to existing workloads
This aborts the entire reconciliation phase, preventing subsequent resources in that phase from being processed.
The Deadlock Cycle
InstallerSet reconciliation happens in sequential phases:
Phase 1: CRDs - If a CRD is TERMINATING, operator returns RECONCILE_AGAIN_ERR
Phase 2: Cluster-scoped - Never reached due to Phase 1 abort
Phase 3: Namespace-scoped - Never reached, ServiceAccounts not created
Phase 4: Deployments - Never reached
Result:
- No ServiceAccounts → Deployments can't start (serviceaccount not found errors)
- No Deployments → Webhooks don't start
- No Webhooks → TektonConfig can't reconcile
- Operator keeps retrying Phase 1 infinitely, fetching the same TERMINATING CRD repeatedly
Symptoms
When the deadlock occurs:
- ✅ Operator pod is running
- ✅ TektonInstallerSets are created (pipeline-main-static, pipeline-main-deployment, etc.)
- ❌ InstallerSets stuck with status: Install failed with message: reconcile again and proceed
- ❌ No ServiceAccounts in openshift-pipelines namespace (pipeline, tekton-pipelines-controller, tekton-pipelines-webhook)
- ❌ No RBAC resources (Roles, RoleBindings)
- ❌ No Deployments created
- ❌ No pods running in openshift-pipelines
- ❌ TektonConfig stuck in non-Ready state
- ❌ Operator logs show infinite loop fetching CRDs: fetching resource CustomResourceDefinition
Reproduction Scenario
This commonly occurs when:
Tekton CRDs exist with active workloads (PipelineRuns, TaskRuns)
Admin attempts to delete CRDs (intentionally or during troubleshooting)
CRDs enter TERMINATING state (Kubernetes won't delete them until workloads are cleaned up)
Operator is reinstalled or restarted (e.g., during upgrade/downgrade)
Operator tries to reconcile while CRDs are TERMINATING
Deadlock occurs - operator gets stuck in infinite CRD fetch loop
Impact
Severity: Critical
Affected Operations:
- Fresh installations (if any CRDs are TERMINATING)
- Operator upgrades/downgrades
- Operator restarts during troubleshooting
- Recovery from corrupted states
Customer Impact:
- Complete operator failure - no Tekton components running
- Pipelines infrastructure non-functional
- Requires manual intervention to recover (creating ServiceAccounts manually or removing finalizers)
Steps to Reproduce
A reproduction script is available: reproduce-deadlock.sh
Manual steps:
h1. 1. Install operator and wait for it to be Ready oc apply -f subscription.yaml h1. 2. Create test workloads oc create namespace test-pipelines cat <<EOF | oc apply -f - apiVersion: tekton.dev/v1 kind: Task metadata: name: test-task namespace: test-pipelines spec: steps: *** name: echo image: registry.access.redhat.com/ubi9/ubi-minimal:latest h2. script: echo "test" apiVersion: tekton.dev/v1 kind: TaskRun metadata: name: test-taskrun namespace: test-pipelines spec: taskRef: name: test-task EOF h1. 3. Delete CRDs while workloads exist (creates TERMINATING state) oc delete crd pipelineruns.tekton.dev taskruns.tekton.dev tasks.tekton.dev & h1. 4. Wait 10 seconds for CRDs to enter TERMINATING state sleep 10 h1. 5. Reinstall operator while CRDs are TERMINATING CURRENT_CSV=$(oc get csv -n openshift-operators -l operators.coreos.com/openshift-pipelines-operator-rh.openshift-operators -o jsonpath='{.items[0].metadata.name}') oc delete subscription openshift-pipelines-operator-rh -n openshift-operators oc delete csv "$CURRENT_CSV" -n openshift-operators oc apply -f subscription.yaml h1. 6. Wait 60 seconds and verify deadlock sleep 60 oc get sa -n openshift-pipelines | grep tekton || echo "DEADLOCK: No ServiceAccounts!"
Expected result: No ServiceAccounts created, no pods running, InstallerSets stuck
Proposed Fix
Option 1 (Recommended): Skip all terminating resources
if res.GetDeletionTimestamp() != nil { ressourceLogger.Debug("resource is being deleted, skipping and continuing with other resources") continue // ← Changed from: return v1alpha1.RECONCILE_AGAIN_ERR }
Rationale:
- Minimal code change (1 line)
- Correct behavior: don't block reconciliation on resources being deleted by other controllers
- If we try to recreate a resource too early, Kubernetes will reject it (409) and we'll retry naturally
Alternative options documented in: FIX-OPTIONS-SUMMARY.md
Workaround
Manual recovery requires:
Create missing ServiceAccounts manually
Or remove finalizers from TERMINATING resources to unblock CRD deletion
Restart operator pod
See: fix-installerset-deadlock.sh for automated workaround script
Related Issues
This was discovered during investigation of OSP 1.15 to 1.14 downgrade failures (SRVKP-10509).
Additional Context
- Detailed root cause analysis: INSTALLERSET-DEADLOCK-ROOT-CAUSE.md
- Fix options comparison: FIX-OPTIONS-SUMMARY.md
- Reproduction script: reproduce-deadlock.sh
- Code location: pkg/reconciler/kubernetes/tektoninstallerset/install.go:166-168
- relates to
-
SRVKP-10509 ServiceMonitor has hardcoded openshift-operators in namespaceSelector, causing Prometheus failures when operator is installed in different namespace
-
- Verified
-