Uploaded image for project: 'RHEL'
  1. RHEL
  2. RHEL-153100

[RFE] Periodic thread pool health summary with saturation detection

Linking RHIVOS CVEs to...Migration: Automation ...Sync from "Extern...XMLWordPrintable

    • Icon: Story Story
    • Resolution: Unresolved
    • Icon: Undefined Undefined
    • None
    • None
    • 389-ds-base
    • None
    • None
    • rhel-idm-ds
    • None
    • False
    • False
    • Hide

      None

      Show
      None
    • None
    • None
    • None
    • None
    • Unspecified
    • Unspecified
    • Unspecified
    • None

      Goal

      As a support engineer reviewing a sosreport from an environment without external monitoring (no PCP, no Grafana), I need periodic thread pool health snapshots in the error log so there is always a historical baseline of worker utilization – and I need the server to escalate severity automatically when sustained saturation is detected.

      Add nsslapd-thread-pool-log-interval (seconds, 0 = disabled) and nsslapd-thread-stall-threshold (consecutive saturated checks before escalating, 0 = no escalation). Schedule a callback via slapi_eq_repeat_rel that logs a structured NOTICE with busy workers, queue depth, connection count, and ops initiated/completed.

      When the check detects sustained saturation – all workers busy with growing queue for N consecutive intervals – escalate severity following the disk monitor pattern in daemon.c (NOTICE -> WARNING -> ALERT). Log an INFO entry when saturation resolves, including the duration. The callback does atomic loads only, completes in microseconds, and is safe in the event queue. If heavier diagnostics are needed later (iterating the per-thread activity array), that work should be offloaded to a short-lived thread to avoid blocking other event queue callbacks (replication retry, task cleanup, DB compaction).

      When both this and the wtime threshold warning from RHEL-153090 are enabled during sustained saturation, the error log receives two independent streams. They're complementary – wtime focuses on per-operation impact, health summary on aggregate pool state – and use different rate limits, so the volume stays manageable. Documentation should clarify the purpose of each.

      Acceptance criteria

      • Verify that setting nsslapd-thread-pool-log-interval: 10 produces a structured NOTICE in the error log every 10 seconds with busy workers, queue depth, connections, and ops counters
      • Verify setting the interval to 0 disables the summary entirely
      • Verify severity escalation: sustained saturation for N consecutive checks produces WARNING, then ALERT
      • Verify an INFO entry is logged when saturation resolves, including the duration of the saturation episode
      • Verify normal-load summaries remain at NOTICE level and do not escalate
      • Verify that with a 1-second interval under sustained load, the health summary NOTICEs appear at consistent ~1s intervals (no visible drift from the callback blocking the event queue)

              idm-ds-dev-bugs IdM DS Dev
              spichugi@redhat.com Simon Pichugin
              IdM DS Dev IdM DS Dev
              IdM DS QE IdM DS QE
              Evgenia Martyniuk Evgenia Martyniuk
              Votes:
              0 Vote for this issue
              Watchers:
              4 Start watching this issue

                Created:
                Updated: