Uploaded image for project: 'RHEL'
  1. RHEL
  2. RHEL-116328

System Runner: Observability Stack

Linking RHIVOS CVEs to...Migration: Automation ...Sync from "Extern...XMLWordPrintable

    • Icon: Bug Bug
    • Resolution: Unresolved
    • Icon: Undefined Undefined
    • None
    • rhel-10.0
    • rteval
    • None
    • No
    • None
    • rhel-kernel-rts-time
    • 0
    • False
    • False
    • Hide

      None

      Show
      None
    • None
    • None
    • None
    • None
    • Unspecified
    • Unspecified
    • Unspecified
    • None

      • Observability stack — RT/HPC-friendly design (Jira summary)

        Objectives

        • Human observability: Grafana dashboards/search for day-to-day ops.
        • Accurate time correlation: µs-grade joins across userspace logs, kernel printk, traces, and metrics.
        • Low overhead: Safe for PREEMPT_RT and HPC nodes.
        • ML dataset: Produce a unified, time-aligned corpus (logs, metrics, traces, profiler outputs) for training a domain model and later LoRA fine-tuning.

          Components

        • Ingest & storage: Loki (central log store).
        • Visualization: Grafana.
        • Log sources: Node system logs (journald) and, when applicable, CRI container logs (/var/log/containers/.log). Only *one driver per workload to avoid duplicates.
        • Tracing: Perfetto (ftrace/perf based) as primary; optional custom ftrace scraper for narrow event sets.
        • Metrics: Prometheus scraping a lightweight exporter (RT/HPC-specific stats only).

          Logging design (labels vs packed payload)

        • Labels (for fast lookup): minimal, stable set — host, boot_id, transport, severity (+ optional unit/app/container when useful).
        • Packed JSON (per entry): tiny object embedded alongside the line with only the high-value fields:
          • boot_epoch_ns — epoch when CLOCK_MONOTONIC started this boot.
          • kernel_offset_ns — offset between kernel source monotonic and journald monotonic.
          • src_mono_us — present on kernel lines for direct printk correlation.
          • Optional context: unit, app, container, environment hints.

      This keeps ingest light (no full journald JSON export) while preserving rich, on-demand query context.


      Offset generation & correlation

      Two lightweight scripts provide timing primitives for µs-accurate joins:

        1. Boot epoch (startup, once):
          Compute BOOT_EPOCH_NS from recent journald "receipt" pairs:
          (__REALTIME_TIMESTAMP_us - __MONOTONIC_TIMESTAMP_us) * 1000 → ns
          Bounded window + (trimmed) median → export BOOT_EPOCH_NS/BOOT_EPOCH_US.
        1. Kernel offset (periodic, e.g., ~30 min):
          From last-N kernel entries this boot:
          (_SOURCE_MONOTONIC_TIMESTAMP_us - __MONOTONIC_TIMESTAMP_us) * 1000 → ns
          Trimmed median; if too few samples, reuse previous (no synthetic printk bursts). Export KERNEL_OFFSET_NS.

      Enables

        • Userspace ↔ printk: src_mono_us + kernel_offset_ns.
        • Userspace/kernel ↔ trace timestamps: boot_epoch_ns + monotonic.
        • Cross-boot ordering: sort by packed boot_epoch_ns across different boot_ids.

          Tracing layer (Perfetto / ftrace)

        • Backend: upstream ftrace (no LTTng drivers to port/maintain).
        • Collector: Perfetto (traced/traced_probes) using ftrace/perf; optional custom scraper for targeted events (e.g., sched_switch, irq_{}{}, softirq_, timerlat/osnoise).
        • Artifacts: store .perfetto-trace files; emit a small Loki line with pointers (host, boot_id, time range).
        • UIs: Perfetto UI (primary), Trace Compass optional.
        • Overhead controls: narrow event sets, bounded buffers, short capture windows; pin readers to the monitoring core. 

          • Metrics layer (Prometheus)

        • Exporter content (RT/HPC-focused):
          • Scheduler/pressure: /proc/pressure/*, runqueue depth, context switches.
          • IRQ/softirq distribution: /proc/interrupts, softnet_stat.
          • CPU behavior: cpufreq/cpuidle residency, turbo/thermal throttling.
          • Memory/IO/network: NUMA stats, NVMe queue errors/latency hints, NIC queues (where accessible).
          • RT tools summary: rtla osnoise/timerlat rolled up to gauges/counters (no heavy streams).
        • Scrape: modest interval (e.g., 2-5s), pinned to a monitoring core; keep series/cardinality in check.

          Low-latency deployment plan (PREEMPT_RT & HPC)

        • Collector CPU placement
          • Most nodes: collectors can run without a dedicated monitoring core if latency budgets allow (ingest path is intentionally light).
          • Ultra-low-latency nodes: reserve a monitoring core and pin all scrapers/collectors to it to avoid RT thread interference.
            Examples: CPUAffinity/AllowedCPUs, Nice, IOSchedulingClass=idle, cpuset cgroup; optionally isolcpus, rcu_nocbs, IRQ affinity for RT cores.
        • Backend placement
          • Preferred: run Loki/Grafana off-box to keep storage/compaction/network IO away from RT workloads.
          • Edge co-location: if needed on the same host, pin to a separate monitoring core and cap background work (CPU/IO quotas, WAL/compactor throttles).

      Rationale: bounded, infrequent journal scans; minimal per-line work; small label set → low CPU/mem at ingest and predictable jitter.


      Benchmarking (to validate)

        • Scheduling/latency: cyclictest p99/p999 on RT cores; PSI (CPU some/avg10); context switches.
        • Collector overhead: per-source CPU%, mem, backpressure (WAL queues, drops).
        • IO impact: disk IO latency during compaction/ingest; NIC interrupts/softirq distribution.
        • Query cost: dashboards that parse packed JSON vs label-only filters.
          Target: no measurable degradation of RT p99/p999; collectors stay low single-digit % on the monitoring core.

          Querying pattern

        1. Filter by labels (e.g., {} {host="...", boot_id="...", transport="kernel"}{}).
        1. Parse packed JSON for boot_epoch_ns, kernel_offset_ns, src_mono_us as needed.
        1. Sort across reboots by packed boot_epoch_ns.

          ML dataset plan

        • Sources: Loki (logs with packed timing), Prometheus (RT/HPC metrics), Perfetto traces, rtla outputs, and run/profiling results (e.g., rteval) captured into PostgreSQL metadata.
        • Unification: normalize all timestamps to a shared monotonic + boot_epoch axis; build time-windowed feature matrices around events of interest.
        • Storage: metadata in PostgreSQL; bulk features in Parquet/columnar; artifacts referenced by URI.
        • Modeling: train a base model on the unified dataset; apply LoRA fine-tuning for customer-specific workloads.

          Deliverables

        • Startup script exporting BOOT_EPOCH_NS (+ one audit log line).
        • Periodic offset script exporting KERNEL_OFFSET_NS.
        • Log pipeline with minimal labels and per-entry packed JSON timing fields.
        • Tracing collector (Perfetto config) and artifact handling.
        • Lightweight metrics exporter for RT/HPC signals + Prometheus scrape config.
        • Deployment guidance for with/without a monitoring core and off-box vs edge backends.
        • Benchmark plan and acceptance thresholds for RT/HPC nodes.

              rhn-gps-chwhite William White
              rhn-gps-chwhite William White
              William White William White
              Waylon Cude Waylon Cude
              Votes:
              0 Vote for this issue
              Watchers:
              7 Start watching this issue

                Created:
                Updated: