-
Bug
-
Resolution: Done
-
Major
-
Pipelines 1.19.0
-
None
-
9
-
False
-
-
False
-
-
Bug Fix
-
Done
-
-
Story (Required)
As a Pipelines-as-Code system trying to manage concurrent pipeline execution, I want the priority semaphore to acquire locks without causing deadlocks, race conditions, or panics, so that the system remains stable and reliable under high load.
This story addresses critical concurrency bugs in the prioritySemaphore implementation. The existing code can lead to deadlocks, panics on empty queues, and data races, causing production instability. Fixing these issues is essential for ensuring the system is robust and production-ready for managing concurrent tasks.
Background (Required)
The `prioritySemaphore` implementation in `pkg/sync/semaphore.go` suffers from several critical concurrency issues that impact system stability:
- Deadlock on Lock Acquisition: The `tryAcquire()` method holds a lock and then calls the `acquire()` method, which attempts to acquire the same non-reentrant `sync.Mutex` again. This causes the goroutine to deadlock and wait forever.
- Panic on Empty Queue: The `tryAcquire()` method contains a race condition where it unconditionally calls `s.pending.pop()`. If the method is invoked when the pending queue is empty, this results in an "index out of range" panic, crashing the process.
- Data Races in Getter Methods: Getter methods such as `getLimit()`, `getCurrentPending()`, and `getCurrentRunning()` read shared data structures (maps, slices) without acquiring a lock. This creates data races when other goroutines modify these structures concurrently, leading to unpredictable behavior.
Out of scope
- Major refactoring of the semaphore's core logic beyond fixing the identified bugs.
- Replacing the underlying semaphore or priority queue implementation with a different library.
- Adding new features to the semaphore mechanism.
Approach (Required)
The solution involves addressing each bug with a targeted fix and adding comprehensive tests.
- 1. Fix Deadlock in `tryAcquire`:
- The recursive call to `s.acquire()` from within `tryAcquire()` will be removed.
- The semaphore acquisition logic (`s.semaphore.TryAcquire(1)`) and state update (`s.running[key] = true`) will be implemented directly inside the `tryAcquire()` method, respecting the existing lock.
- 2. Prevent Panic on Empty Queue:
- A `shouldPopFromQueue` boolean flag will be introduced within `tryAcquire()`.
- This flag is set to `true` only when an item is being processed from a non-empty queue.
- The call to `s.pending.pop()` will be made conditional on this flag, ensuring it is never called on an empty queue.
Introduced a `shouldPopFromQueue` flag to track whether we're processing a queued item vs. a direct semaphore acquisition:
```go
{ // ... shouldPopFromQueue = true // Only set when queue has items }
var nextKey string
shouldPopFromQueue := false
if s.pending.Len() > 0
if s.semaphore.TryAcquire(1) {
s.running[key] = true
if shouldPopFromQueue
return true, ""
}