-
Bug
-
Resolution: Unresolved
-
Undefined
-
None
-
rhel-10.0
-
None
-
No
-
None
-
rhel-kernel-rts-time
-
0
-
False
-
False
-
-
None
-
None
-
None
-
None
-
Unspecified
-
Unspecified
-
Unspecified
-
None
-
- Human observability: Grafana dashboards/search for day-to-day ops.
-
- Accurate time correlation: µs-grade joins across userspace logs, kernel printk, traces, and metrics.
-
- Low overhead: Safe for PREEMPT_RT and HPC nodes.
-
- Ingest & storage: Loki (central log store).
-
- Visualization: Grafana.
-
- Log sources: Node system logs (journald) and, when applicable, CRI container logs (/var/log/containers/.log). Only *one driver per workload to avoid duplicates.
-
- Tracing: Perfetto (ftrace/perf based) as primary; optional custom ftrace scraper for narrow event sets.
-
- Labels (for fast lookup): minimal, stable set — host, boot_id, transport, severity (+ optional unit/app/container when useful).
-
- Packed JSON (per entry): tiny object embedded alongside the line with only the high-value fields:
-
-
- boot_epoch_ns — epoch when CLOCK_MONOTONIC started this boot.
-
-
-
- kernel_offset_ns — offset between kernel source monotonic and journald monotonic.
-
-
-
- src_mono_us — present on kernel lines for direct printk correlation.
-
-
-
- Optional context: unit, app, container, environment hints.
-
This keeps ingest light (no full journald JSON export) while preserving rich, on-demand query context.
Offset generation & correlation
Two lightweight scripts provide timing primitives for µs-accurate joins:
-
- Boot epoch (startup, once):
Compute BOOT_EPOCH_NS from recent journald "receipt" pairs:
(__REALTIME_TIMESTAMP_us - __MONOTONIC_TIMESTAMP_us) * 1000 → ns
Bounded window + (trimmed) median → export BOOT_EPOCH_NS/BOOT_EPOCH_US.
- Boot epoch (startup, once):
-
- Kernel offset (periodic, e.g., ~30 min):
From last-N kernel entries this boot:
(_SOURCE_MONOTONIC_TIMESTAMP_us - __MONOTONIC_TIMESTAMP_us) * 1000 → ns
Trimmed median; if too few samples, reuse previous (no synthetic printk bursts). Export KERNEL_OFFSET_NS.
- Kernel offset (periodic, e.g., ~30 min):
Enables
-
- Userspace ↔ printk: src_mono_us + kernel_offset_ns.
-
- Userspace/kernel ↔ trace timestamps: boot_epoch_ns + monotonic.
-
- Backend: upstream ftrace (no LTTng drivers to port/maintain).
-
- Collector: Perfetto (traced/traced_probes) using ftrace/perf; optional custom scraper for targeted events (e.g., sched_switch, irq_{}{}, softirq_, timerlat/osnoise).
-
- Artifacts: store .perfetto-trace files; emit a small Loki line with pointers (host, boot_id, time range).
-
- UIs: Perfetto UI (primary), Trace Compass optional.
-
- Exporter content (RT/HPC-focused):
-
-
- Scheduler/pressure: /proc/pressure/*, runqueue depth, context switches.
-
-
-
- IRQ/softirq distribution: /proc/interrupts, softnet_stat.
-
-
-
- CPU behavior: cpufreq/cpuidle residency, turbo/thermal throttling.
-
-
-
- Memory/IO/network: NUMA stats, NVMe queue errors/latency hints, NIC queues (where accessible).
-
-
-
- RT tools summary: rtla osnoise/timerlat rolled up to gauges/counters (no heavy streams).
-
-
- Collector CPU placement
-
-
- Most nodes: collectors can run without a dedicated monitoring core if latency budgets allow (ingest path is intentionally light).
-
-
-
- Ultra-low-latency nodes: reserve a monitoring core and pin all scrapers/collectors to it to avoid RT thread interference.
Examples: CPUAffinity/AllowedCPUs, Nice, IOSchedulingClass=idle, cpuset cgroup; optionally isolcpus, rcu_nocbs, IRQ affinity for RT cores.
- Ultra-low-latency nodes: reserve a monitoring core and pin all scrapers/collectors to it to avoid RT thread interference.
-
-
- Backend placement
-
-
- Preferred: run Loki/Grafana off-box to keep storage/compaction/network IO away from RT workloads.
-
-
-
- Edge co-location: if needed on the same host, pin to a separate monitoring core and cap background work (CPU/IO quotas, WAL/compactor throttles).
-
Rationale: bounded, infrequent journal scans; minimal per-line work; small label set → low CPU/mem at ingest and predictable jitter.
Benchmarking (to validate)
-
- Scheduling/latency: cyclictest p99/p999 on RT cores; PSI (CPU some/avg10); context switches.
-
- Collector overhead: per-source CPU%, mem, backpressure (WAL queues, drops).
-
- IO impact: disk IO latency during compaction/ingest; NIC interrupts/softirq distribution.
-
- Filter by labels (e.g., {} {host="...", boot_id="...", transport="kernel"}{}).
-
- Parse packed JSON for boot_epoch_ns, kernel_offset_ns, src_mono_us as needed.
-
- Sources: Loki (logs with packed timing), Prometheus (RT/HPC metrics), Perfetto traces, rtla outputs, and run/profiling results (e.g., rteval) captured into PostgreSQL metadata.
-
- Unification: normalize all timestamps to a shared monotonic + boot_epoch axis; build time-windowed feature matrices around events of interest.
-
- Storage: metadata in PostgreSQL; bulk features in Parquet/columnar; artifacts referenced by URI.
-
- Startup script exporting BOOT_EPOCH_NS (+ one audit log line).
-
- Periodic offset script exporting KERNEL_OFFSET_NS.
-
- Log pipeline with minimal labels and per-entry packed JSON timing fields.
-
- Tracing collector (Perfetto config) and artifact handling.
-
- Lightweight metrics exporter for RT/HPC signals + Prometheus scrape config.
-
- Deployment guidance for with/without a monitoring core and off-box vs edge backends.
-
- Benchmark plan and acceptance thresholds for RT/HPC nodes.
- split from
-
RHEL-115498 System Runner: Observability Stack - Deploy Perfetto
-
- New
-
-
RHEL-116336 System Runner Observability Stack: Alloy to Loki
-
- Closed
-