-
Bug
-
Resolution: Unresolved
-
Undefined
-
None
-
rhel-10.1
-
None
-
No
-
Important
-
1
-
rhel-kernel-rts-time
-
0
-
False
-
False
-
-
None
-
CK Parent Issues In Progress
-
None
-
None
-
Unspecified
-
Unspecified
-
Unspecified
-
-
All
-
None
Problem Summary
When the DL server is disabled and then re-enabled on a monitored CPU, stalld fails to update its internal state correctly. It continues to report tasks as starving even after the DL server is active again and should be handling the workload, leading to persistent false-positive starvation reports.
Steps to Reproduce
- Start stalld on CPU 0 to monitor CPU 1:
stalld -v -b queue_track -c 1 -a 0
- Start two CPU-bound tasks on CPU 1: one SCHED_NORMAL (e.g., PID 9368) and one SCHED_FIFO (e.g., PID 9369).
taskset -c 1 bash -c 'while :; do :; done' & taskset -c 1 chrt -f 40 bash -c 'while :; do :; done' &
- Observe Initial State: stalld correctly detects that the SCHED_NORMAL task (9368) is starving.
stalld: found task: bash:9368 ready to run in CPU 1 single_threaded_main: checking cpu 1 - rt: 1 - starving: 2
- Disable the DL server on CPU 1:
echo 0 > /sys/kernel/debug/sched/fair_server/cpu1/runtime
- Observe State with DL Disabled: As expected, stalld starts to report task 9368 as starving.
stalld: cpu: 1 pid: 9368 ctx: 5459 R stalld: cpu: 1 pid: 9369 ctx: 5578 stalld: found task: bash:9368 starving in CPU 1 single_threaded_main: checking cpu 1 - rt: 1 - starving: 1 stalld: cpu 1: pid: 9369 starving for 10
- Re-enable the DL server on CPU 1:
echo 50000000 > /sys/kernel/debug/sched/fair_server/cpu1/runtime
Expected Result
After the DL server is re-enabled, it resumes scheduling the starving tasks. stalld should detect this change and stop reporting task 9368 as starving.
Actual Result
stalld fails to recognize that the DL server is active again. It gets stuck in its previous state and continues to incorrectly report the SCHED_NORMAL task (9368) as starving indefinitely.