-
Bug
-
Resolution: Done
-
Blocker
-
Pipelines 1.17.0
-
24
-
False
-
None
-
False
-
-
-
Story:
As a developer,
I want the pipeline to handle concurrency failures gracefully when the cluster becomes busy,
so that pipeline runs continue to progress without encountering deadlocks.
Description of problem:
When pipeline runs encounter concurrency failures under high load, the process currently fails to release locks properly. This causes a deadlock as the semaphore fails to remove the lock. This issue arises specifically when the underlying cluster experiences delays in updating the pipeline run's "in progress" status.
To reproduce this issue, modify the updatePipelineRunToInProgress function to simulate concurrency failures with the following test setup:
- Set up a repository with three pipeline runs: test-1, test-2, and test-3, all matching a pull request.
- Configure a concurrency limit of 1 in the repository specification.
- When the pull request is executed:
- test-1 should run successfully.
- test-2 should encounter an error.
- test-3 should be triggered and start running.
In a high-load scenario, test-2 fails but test-3 should still start; however, due to a deadlock caused by lock retention, test-3 remains stalled.
Prerequisites (if any, like setup, operators/versions):
- The pipeline should gracefully handle concurrency failures without causing a deadlock.
- Pipeline runs should continue to the next available run when one run encounters a failure.
- The semaphore should release locks appropriately to prevent deadlocks.
Steps to Reproduce
- Implement the following patch in reconciler/reconciler.go:
func randomError(prn string) error {
if strings.HasPrefix(prn, "test-2")
return nil
}
Add this at the beginning of the updatePipelineRunToInProgress function:
if err := randomError(pr.GetName()); err != nil {
return err
}
- Trigger a pull request to initiate test-1, test-2, and test-3 with the concurrency limit set to 1.
- The issue occurs only under high load, so modify updatePipelineRunToInProgress for effective simulation.
- This patch introduces a simulated random error for test cases to stress-test the concurrency behavior under load.
Expected results:
Reproducibility (Always/Intermittent/Only Once):
Acceptance criteria:
Definition of Done:
Build Details:
Additional info (Such as Logs, Screenshots, etc):