Uploaded image for project: 'RHEL'
  1. RHEL
  2. RHEL-111803

stalld: Incorrect starvation detection after task CPU migration

Linking RHIVOS CVEs to...Migration: Automation ...SWIFT: POC ConversionSync from "Extern...XMLWordPrintable

    • Icon: Bug Bug
    • Resolution: Unresolved
    • Icon: Undefined Undefined
    • None
    • rhel-9.8
    • stalld
    • None
    • No
    • Moderate
    • rhel-kernel-rts-time
    • 0
    • False
    • False
    • Hide

      None

      Show
      None
    • None
    • None
    • None
    • None
    • Unspecified
    • Unspecified
    • Unspecified
    • None

      Problem Summary

      stalld does not properly track task CPU migrations. When a monitored task moves from one CPU to another, the daemon's internal state becomes outdated. This leads to incorrect starvation detection, causing stalld to erroneously boost tasks that are no longer starving or fail to un-boost tasks that have moved.

      Root Cause Analysis

      The root cause is that stalld does not appear to handle the sched_migrate_task event. When this event occurs, the daemon should update its internal queues by removing the task from the old CPU's queue and adding it to the new target CPU's queue. This logic is currently missing.

       


      Reproduction Scenarios

      Scenario 1: Migrating the Starving (SCHED_NORMAL) Task

      Steps to Reproduce:

      1. Run stalld on CPU 0, monitoring CPUs 1 and 2:
        stalld -v -b queue_track -c 1,2 -a 0 
      1. Create two CPU-bound tasks pinned to CPU 1 (note example PIDs):
         
        taskset -c 1 bash -c 'while :; do :; done' & # SCHED_NORMAL task (e.g., PID=1000)
        
        taskset -c 1 chrt -f 40 bash -c 'while :; do :; done' & # SCHED_FIFO task (e.g., PID=1001)
      1. Migrate the SCHED_NORMAL task (PID 1000) to CPU 2:
        taskset -p 04 1000

      Expected Result:

      Once the SCHED_NORMAL task is moved to CPU 2, it is no longer competing with the SCHED_FIFO task on CPU 1. Therefore, stalld should recognize the migration and stop boosting it.

      Actual Result:

      stalld fails to detect the migration and continues to incorrectly report the SCHED_NORMAL task as starving on its original CPU (CPU 1), boosting it erroneously.

      stalld: found task: bash:1000 starving in CPU 1
      single_threaded_main: checking cpu 1 - rt: 1 - starving: 1
      stalld: cpu 1: pid: 1000 starving for 98
      stalld: boosted pid 1000 (bash) (cpu 1) using SCHED_DEADLINE

      Scenario 2: Migrating the High-Priority (SCHED_FIFO) Task

      Steps to Reproduce:

      1. Follow the initial setup (steps 1-2) from Scenario 1.
      1. Migrate the SCHED_FIFO task (PID 1001) to CPU 2:
        taskset -p 04 1001 

      Expected Result:

      With the SCHED_FIFO task no longer on CPU 1, the SCHED_NORMAL task is no longer starving. stalld should detect this change and stop boosting the SCHED_NORMAL task.

      Actual Result:

      stalld fails to detect that the SCHED_FIFO task has moved. It continues to believe the SCHED_NORMAL task is starving and boosts it indefinitely.

      stalld: cpu: 1 pid: 2714 ctx: 10 stalld: cpu: 1 pid: 2713 ctx: 182 R
      stalld: found task: bash:2714 starving in CPU 1
      single_threaded_main: checking cpu 1 - rt: 1 - starving: 1
      stalld: cpu 1: pid: 2714 starving for 361
      stalld: boosted pid 2714 (bash) (cpu 1) using SCHED_DEADLINE.

              wandercosta Wander Costa
              wandercosta Wander Costa
              Clark Williams Clark Williams
              Chang Yin Chang Yin
              Votes:
              0 Vote for this issue
              Watchers:
              6 Start watching this issue

                Created:
                Updated: