-
Bug
-
Resolution: Done
-
Major
-
None
-
Pipelines 1.19.0
-
False
-
-
False
-
Release Note Not Required
-
-
Description of problem:
When Tekton Results becomes overloaded, for example when a surge of 5-6x the normal volume of PLRs completes
Results can get into a state where the workqueue is so large that it is unable to process new object creation before the objects are deleted. It is unable to add finalizers to TaskRuns and PipelineRuns before they have completed and been pruned.
Because now almost every queued event cannot be processed, Results appears to get into a state where it tries to reconcile every object, fails in a permanent way, but still attempts to retry the reconciliation after some time. This results in the workqueue being "low", and reconciliation latency being "low", but reconciliation success rate being extremely poor
All of these thousands of stale reconciliations are not invisibly stored in the retry queue, even though their k8s objects have long since been deleted.
Recovery for this is straightforward but manual: restart the pod. Results needs to be able to recover from this properly however. If an object no longer exists in the cluster, we shouldn't keep retrying to reconcile it.
Prerequisites (if any, like setup, operators/versions):
Steps to Reproduce
# <steps>
Actual results:
Expected results:
Reproducibility (Always/Intermittent/Only Once):
Acceptance criteria:
Definition of Done:
Build Details:
Additional info (Such as Logs, Screenshots, etc):