Uploaded image for project: 'OpenShift Pipelines'
  1. OpenShift Pipelines
  2. SRVKP-8198

Fix Deadlock, Race Conditions, and Panic in prioritySemaphore

XMLWordPrintable

    • 9
    • False
    • Hide

      None

      Show
      None
    • False
    • Hide
      Previously, the `prioritySemaphore` implementation could cause deadlocks, race conditions, and panics when handling concurrent
      operations. This was due to improper recursive lock acquisition and a lack of thread safety in data access methods. With this
      update, the locking logic has been corrected and all shared data access is now properly synchronized, ensuring system stability
      under high load
      Show
      Previously, the `prioritySemaphore` implementation could cause deadlocks, race conditions, and panics when handling concurrent operations. This was due to improper recursive lock acquisition and a lack of thread safety in data access methods. With this update, the locking logic has been corrected and all shared data access is now properly synchronized, ensuring system stability under high load
    • Bug Fix
    • Done

      Story (Required)

      As a Pipelines-as-Code system trying to manage concurrent pipeline execution, I want the priority semaphore to acquire locks without causing deadlocks, race conditions, or panics, so that the system remains stable and reliable under high load.

      This story addresses critical concurrency bugs in the prioritySemaphore implementation. The existing code can lead to deadlocks, panics on empty queues, and data races, causing production instability. Fixing these issues is essential for ensuring the system is robust and production-ready for managing concurrent tasks.

      Background (Required)

      The `prioritySemaphore` implementation in `pkg/sync/semaphore.go` suffers from several critical concurrency issues that impact system stability:

      • Deadlock on Lock Acquisition: The `tryAcquire()` method holds a lock and then calls the `acquire()` method, which attempts to acquire the same non-reentrant `sync.Mutex` again. This causes the goroutine to deadlock and wait forever.
      • Panic on Empty Queue: The `tryAcquire()` method contains a race condition where it unconditionally calls `s.pending.pop()`. If the method is invoked when the pending queue is empty, this results in an "index out of range" panic, crashing the process.
      • Data Races in Getter Methods: Getter methods such as `getLimit()`, `getCurrentPending()`, and `getCurrentRunning()` read shared data structures (maps, slices) without acquiring a lock. This creates data races when other goroutines modify these structures concurrently, leading to unpredictable behavior.

        Out of scope

      • Major refactoring of the semaphore's core logic beyond fixing the identified bugs.
      • Replacing the underlying semaphore or priority queue implementation with a different library.
      • Adding new features to the semaphore mechanism.

        Approach (Required)

      The solution involves addressing each bug with a targeted fix and adding comprehensive tests.

      • 1. Fix Deadlock in `tryAcquire`:
        • The recursive call to `s.acquire()` from within `tryAcquire()` will be removed.
        • The semaphore acquisition logic (`s.semaphore.TryAcquire(1)`) and state update (`s.running[key] = true`) will be implemented directly inside the `tryAcquire()` method, respecting the existing lock.
      • 2. Prevent Panic on Empty Queue:
        • A `shouldPopFromQueue` boolean flag will be introduced within `tryAcquire()`.
        • This flag is set to `true` only when an item is being processed from a non-empty queue.
        • The call to `s.pending.pop()` will be made conditional on this flag, ensuring it is never called on an empty queue.

          Introduced a `shouldPopFromQueue` flag to track whether we're processing a queued item vs. a direct semaphore acquisition:

          ```go
          var nextKey string
          shouldPopFromQueue := false
          if s.pending.Len() > 0

          { // ... shouldPopFromQueue = true // Only set when queue has items }

      if s.semaphore.TryAcquire(1) {
      s.running[key] = true
      if shouldPopFromQueue

      { // Only pop when we should s.pending.pop() }

      return true, ""
      }

              cboudjna@redhat.com Chmouel Boudjnah
              cboudjna@redhat.com Chmouel Boudjnah
              Votes:
              0 Vote for this issue
              Watchers:
              3 Start watching this issue

                Created:
                Updated:
                Resolved: