-
Task
-
Resolution: Unresolved
-
Minor
-
None
-
None
This ticket is extracted from https://issues.redhat.com/browse/ROX-32316 to let the 32316 be closed for the release 4.10.
Here, we continue the work on the ACK/NACK-driven fairness.
VM Index Report Rate Limiting with ACK/NACK
Problem Statement
Scanner V4 matcher can only process ~3 VM index reports per second. When this limit is exceeded, reports queue up indefinitely in Central's unbounded queue, leading to OOM kills (observed: 15GB heap with 15k messages).
Solution Overview
Implement token-bucket rate limiting for VM index reports in Central with per-sensor fairness, ACK/NACK feedback to Sensor, and exponential backoff retry in Compliance container.
Key Requirements
- Equal-share fairness across sensors (each gets 1/N of global capacity)
- 5-second burst tolerance using token bucket algorithm
- NACK feedback when rate limit exceeded
- Configurable global rate limit (default: 3 req/sec, disabled initially)
- Backward compatible with capability negotiation
Architecture Changes
Rate Limiting Flow:
Compliance → Sensor → Central (Rate Limiter) → Scanner V4
↓ ↓
NACK ← ← ← ← ← ← NACK (rate exceeded)
↓
Retry (exponential backoff)
New Protobuf Message:
- Introduce SensorACK - generic acknowledgement message from Central to Sensor
- Deprecate NodeInventoryACK (kept for backward compatibility)
- Capability-based routing: centralsensor.SensorACKSupport
Implementation Phases (4 PRs)
PR #1: Foundation (~200 lines)
- Add SensorACK protobuf message
- Add capability constant SensorACKSupport
- Deprecate NodeInventoryACK with option deprecated = true
- Add environment variables: ROX_VM_INDEX_REPORT_RATE_LIMIT, ROX_VM_INDEX_REPORT_BURST_DURATION
- No behavior changes
PR #2: Rate Limiter + Central Integration (~500 lines)
- Implement pkg/virtualmachine/ratelimit/ratelimit.go using golang.org/x/time/rate
- Per-sensor token buckets with dynamic rebalancing
- Integrate into central/sensor/service/pipeline/virtualmachineindex/pipeline.go
- Capability-based message routing (SensorACK if capability present, NodeInventoryACK otherwise)
- Add Central metrics: vm_index_report_accepted_total, vm_index_report_rejected_total
- Rate limiting disabled by default (ROX_VM_INDEX_REPORT_RATE_LIMIT=0)
PR #3: Sensor ACK/NACK Forwarding (~250 lines)
- Advertise SensorACKSupport capability in Sensor Hello
- Handle both SensorACK (new) and NodeInventoryACK (legacy) for backward compat
- Forward ACK/NACK from Central to Compliance container
- Add Sensor metrics: vm_index_report_ack_received_total
PR #4: Compliance Retry Logic (~250 lines)
- Implement exponential backoff retry in compliance/virtualmachines/roxagent/sender/sender.go
- Backoff parameters: 1s → 2s → 4s → 8s → ... (max 60s)
- Add Compliance metrics: vm_index_report_send_attempts, vm_index_report_send_duration_seconds
Backward Compatibility (Capability-Based)
| Central | Sensor | Capability | Message Sent | Result |
|---|---|---|---|---|
| 4.10 | 4.9 | Not advertised | NodeInventoryACK | |
| 4.10 | 4.10 | Advertised | SensorACK | |
| 4.9 | 4.10 | Advertised (ignored) | NodeInventoryACK |
- Central checks conn.HasCapability(SensorACKSupport) before sending
- No crashes, no coordination needed, multi-tenant safe
- Can remove NodeInventoryACK after 2-3 releases
Technical Decisions
Rate Limiter:
- Reuse golang.org/x/time/rate.Limiter (proven implementation)
- Equal-share fairness: perSensorRate = globalRate / numSensors
- Dynamic rebalancing when sensors connect/disconnect
- Burst capacity: perSensorRate * burstDuration
Message Naming:
- SensorACK instead of ComplianceACK - more accurate since message goes to Sensor
- Generic design supports future message types (node inventory, node index, VM index)
Configuration:
- ROX_VM_INDEX_REPORT_RATE_LIMIT - requests per second (0 = unlimited, default: 0 for safe rollout)
- ROX_VM_INDEX_REPORT_BURST_DURATION - burst window in seconds (default: 5s)
Success Criteria
- Central no longer experiences OOM under burst VM index report traffic
- Each sensor receives fair share of capacity (observable via metrics)
- Rejected reports retry successfully with exponential backoff
- Scanner V4 processing stays under configured limit
- End-to-end latency acceptable (ACK within 10s for accepted reports)
Metrics Added
Central: accept/reject counters per cluster
Sensor: ACK/NACK forwarding counters
Compliance: retry attempts and duration histograms
Full visibility into entire flow from Compliance → Sensor → Central → Scanner V4 and back.