Loading...

Linking RHIVOS CVEs to...

Migration: Automation ...

RHELPRIO AssignedTeam ...

SWIFT: POC Conversion

Sync from "Extern...

XML

Word

Printable

Type: Epic
Resolution: Unresolved
Priority: Undefined
Fix Version/s: None
Affects Version/s: rhel-10.0
Component/s: rteval
Labels:
None

Severity:
None

AssignedTeam:
rhel-kernel-rts-time

Story Points:
0
Blocked:
False
Ready:
False
Blocked Reason:

Hide

None

Show
None
Product Documentation Required:
None
Sprint:
None

Preliminary Testing:
None
Test Coverage:
None

ProdDocsReview-CCS:
Unspecified
ProdDocsReview-Dev:
Unspecified
ProdDocsReview-QE:
Unspecified

Experience:

PX Impact Score:
SFDC Cases Counter:
SFDC Cases Open:
SFDC Cases Links:

Planning:
None

Observability stack — RT/HPC-friendly design (Jira summary)

Objectives

- Human observability: Grafana dashboards/search for day-to-day ops.

- Accurate time correlation: µs-grade joins across userspace logs, kernel printk, traces, and metrics.

- Low overhead: Safe for PREEMPT_RT and HPC nodes.

- ML dataset: Produce a unified, time-aligned corpus (logs, metrics, traces, profiler outputs) for training a domain model and later LoRA fine-tuning.
  
  Components

- Ingest & storage: Loki (central log store).

- Visualization: Grafana.

- Log sources: Node system logs (journald) and, when applicable, CRI container logs (/var/log/containers/.log). Only *one driver per workload to avoid duplicates.

- Tracing: Perfetto (ftrace/perf based) as primary; optional custom ftrace scraper for narrow event sets.

- Metrics: Prometheus scraping a lightweight exporter (RT/HPC-specific stats only).
  
  Logging design (labels vs packed payload)

- Labels (for fast lookup): minimal, stable set — host, boot_id, transport, severity (+ optional unit/app/container when useful).

- Packed JSON (per entry): tiny object embedded alongside the line with only the high-value fields:

- - boot_epoch_ns — epoch when CLOCK_MONOTONIC started this boot.

- - kernel_offset_ns — offset between kernel source monotonic and journald monotonic.

- - src_mono_us — present on kernel lines for direct printk correlation.

- - Optional context: unit, app, container, environment hints.

This keeps ingest light (no full journald JSON export) while preserving rich, on-demand query context.

Offset generation & correlation

Two lightweight scripts provide timing primitives for µs-accurate joins:

1. Boot epoch (startup, once):
  Compute BOOT_EPOCH_NS from recent journald "receipt" pairs:
  (__REALTIME_TIMESTAMP_us - __MONOTONIC_TIMESTAMP_us) * 1000 → ns
  Bounded window + (trimmed) median → export BOOT_EPOCH_NS/BOOT_EPOCH_US.

1. Kernel offset (periodic, e.g., ~30 min):
  From last-N kernel entries this boot:
  (_SOURCE_MONOTONIC_TIMESTAMP_us - __MONOTONIC_TIMESTAMP_us) * 1000 → ns
  Trimmed median; if too few samples, reuse previous (no synthetic printk bursts). Export KERNEL_OFFSET_NS.

Enables

- Userspace ↔ printk: src_mono_us + kernel_offset_ns.

- Userspace/kernel ↔ trace timestamps: boot_epoch_ns + monotonic.

- Cross-boot ordering: sort by packed boot_epoch_ns across different boot_ids.
  
  Tracing layer (Perfetto / ftrace)

- Backend: upstream ftrace (no LTTng drivers to port/maintain).

- Collector: Perfetto (traced/traced_probes) using ftrace/perf; optional custom scraper for targeted events (e.g., sched_switch, irq_{}{}, softirq_, timerlat/osnoise).

- Artifacts: store .perfetto-trace files; emit a small Loki line with pointers (host, boot_id, time range).

- UIs: Perfetto UI (primary), Trace Compass optional.

- Overhead controls: narrow event sets, bounded buffers, short capture windows; pin readers to the monitoring core.
  - Metrics layer (Prometheus)

- Exporter content (RT/HPC-focused):

- - Scheduler/pressure: /proc/pressure/*, runqueue depth, context switches.

- - IRQ/softirq distribution: /proc/interrupts, softnet_stat.

- - CPU behavior: cpufreq/cpuidle residency, turbo/thermal throttling.

- - Memory/IO/network: NUMA stats, NVMe queue errors/latency hints, NIC queues (where accessible).

- - RT tools summary: rtla osnoise/timerlat rolled up to gauges/counters (no heavy streams).

- Scrape: modest interval (e.g., 2-5s), pinned to a monitoring core; keep series/cardinality in check.
  
  Low-latency deployment plan (PREEMPT_RT & HPC)

- Collector CPU placement

- - Most nodes: collectors can run without a dedicated monitoring core if latency budgets allow (ingest path is intentionally light).

- - Ultra-low-latency nodes: reserve a monitoring core and pin all scrapers/collectors to it to avoid RT thread interference.
    Examples: CPUAffinity/AllowedCPUs, Nice, IOSchedulingClass=idle, cpuset cgroup; optionally isolcpus, rcu_nocbs, IRQ affinity for RT cores.

- Backend placement

- - Preferred: run Loki/Grafana off-box to keep storage/compaction/network IO away from RT workloads.

- - Edge co-location: if needed on the same host, pin to a separate monitoring core and cap background work (CPU/IO quotas, WAL/compactor throttles).

Rationale: bounded, infrequent journal scans; minimal per-line work; small label set → low CPU/mem at ingest and predictable jitter.

Benchmarking (to validate)

- Scheduling/latency: cyclictest p99/p999 on RT cores; PSI (CPU some/avg10); context switches.

- Collector overhead: per-source CPU%, mem, backpressure (WAL queues, drops).

- IO impact: disk IO latency during compaction/ingest; NIC interrupts/softirq distribution.

- Query cost: dashboards that parse packed JSON vs label-only filters.
  Target: no measurable degradation of RT p99/p999; collectors stay low single-digit % on the monitoring core.
  
  Querying pattern

1. Filter by labels (e.g., {} {host="...", boot_id="...", transport="kernel"}{}).

1. Parse packed JSON for boot_epoch_ns, kernel_offset_ns, src_mono_us as needed.

1. Sort across reboots by packed boot_epoch_ns.
  
  ML dataset plan

- Sources: Loki (logs with packed timing), Prometheus (RT/HPC metrics), Perfetto traces, rtla outputs, and run/profiling results (e.g., rteval) captured into PostgreSQL metadata.

- Unification: normalize all timestamps to a shared monotonic + boot_epoch axis; build time-windowed feature matrices around events of interest.

- Storage: metadata in PostgreSQL; bulk features in Parquet/columnar; artifacts referenced by URI.

- Modeling: train a base model on the unified dataset; apply LoRA fine-tuning for customer-specific workloads.
  
  Deliverables

- Startup script exporting BOOT_EPOCH_NS (+ one audit log line).

- Periodic offset script exporting KERNEL_OFFSET_NS.

- Log pipeline with minimal labels and per-entry packed JSON timing fields.

- Tracing collector (Perfetto config) and artifact handling.

- Lightweight metrics exporter for RT/HPC signals + Prometheus scrape config.

- Deployment guidance for with/without a monitoring core and off-box vs edge backends.

- Benchmark plan and acceptance thresholds for RT/HPC nodes.

Assignee:: William White

Reporter:: William White

Developer:: William White

QA Contact:: Waylon Cude

Votes:: 0 Vote for this issue

Watchers:: 7 Start watching this issue

Created:: 2025/09/19 7:27 PM

Updated:: 2025/10/16 11:44 PM

Stale Date:: 2026/09/18

Details

Description

Observability stack — RT/HPC-friendly design (Jira summary)

Objectives

Components

Logging design (labels vs packed payload)

Offset generation & correlation

Tracing layer (Perfetto / ftrace)

Metrics layer (Prometheus)

Low-latency deployment plan (PREEMPT_RT & HPC)

Benchmarking (to validate)

Querying pattern

ML dataset plan

Deliverables

Attachments

Easy Agile Planning Poker

Activity

People

Dates