Uploaded image for project: 'Red Hat Advanced Cluster Security'
  1. Red Hat Advanced Cluster Security
  2. ROX-32848

ACK-based fairness in VM Index Report Rate Limiting

    • Icon: Task Task
    • Resolution: Unresolved
    • Icon: Minor Minor
    • None
    • None
    • ACS Virt Support
    • Product / Portfolio Work
    • False
    • Hide

      None

      Show
      None
    • False
    • Not Selected

      This ticket is extracted from https://issues.redhat.com/browse/ROX-32316 to let the 32316 be closed for the release 4.10.

       Here, we continue the work on the ACK/NACK-driven fairness. 

      VM Index Report Rate Limiting with ACK/NACK

      Problem Statement

      Scanner V4 matcher can only process ~3 VM index reports per second. When this limit is exceeded, reports queue up indefinitely in Central's unbounded queue, leading to OOM kills (observed: 15GB heap with 15k messages).

      Solution Overview

      Implement token-bucket rate limiting for VM index reports in Central with per-sensor fairness, ACK/NACK feedback to Sensor, and exponential backoff retry in Compliance container.

      Key Requirements

      • Equal-share fairness across sensors (each gets 1/N of global capacity)
      • 5-second burst tolerance using token bucket algorithm
      • NACK feedback when rate limit exceeded
      • Configurable global rate limit (default: 3 req/sec, disabled initially)
      • Backward compatible with capability negotiation

      Architecture Changes

      Rate Limiting Flow:

      Compliance → Sensor → Central (Rate Limiter) → Scanner V4
                      ↓              ↓
                  NACK ← ← ← ← ← ← NACK (rate exceeded)
                      ↓
                Retry (exponential backoff)
      

      New Protobuf Message:

      • Introduce SensorACK - generic acknowledgement message from Central to Sensor
      • Deprecate NodeInventoryACK (kept for backward compatibility)
      • Capability-based routing: centralsensor.SensorACKSupport

      Implementation Phases (4 PRs)

      PR #1: Foundation (~200 lines)

      • Add SensorACK protobuf message
      • Add capability constant SensorACKSupport
      • Deprecate NodeInventoryACK with option deprecated = true
      • Add environment variables: ROX_VM_INDEX_REPORT_RATE_LIMIT, ROX_VM_INDEX_REPORT_BURST_DURATION
      • No behavior changes

      PR #2: Rate Limiter + Central Integration (~500 lines)

      • Implement pkg/virtualmachine/ratelimit/ratelimit.go using golang.org/x/time/rate
      • Per-sensor token buckets with dynamic rebalancing
      • Integrate into central/sensor/service/pipeline/virtualmachineindex/pipeline.go
      • Capability-based message routing (SensorACK if capability present, NodeInventoryACK otherwise)
      • Add Central metrics: vm_index_report_accepted_total, vm_index_report_rejected_total
      • Rate limiting disabled by default (ROX_VM_INDEX_REPORT_RATE_LIMIT=0)

      PR #3: Sensor ACK/NACK Forwarding (~250 lines)

      • Advertise SensorACKSupport capability in Sensor Hello
      • Handle both SensorACK (new) and NodeInventoryACK (legacy) for backward compat
      • Forward ACK/NACK from Central to Compliance container
      • Add Sensor metrics: vm_index_report_ack_received_total

      PR #4: Compliance Retry Logic (~250 lines)

      • Implement exponential backoff retry in compliance/virtualmachines/roxagent/sender/sender.go
      • Backoff parameters: 1s → 2s → 4s → 8s → ... (max 60s)
      • Add Compliance metrics: vm_index_report_send_attempts, vm_index_report_send_duration_seconds

      Backward Compatibility (Capability-Based)

      Central Sensor Capability Message Sent Result
      4.10 4.9 Not advertised NodeInventoryACK Works (legacy)
      4.10 4.10 Advertised SensorACK Works (new)
      4.9 4.10 Advertised (ignored) NodeInventoryACK Works
      • Central checks conn.HasCapability(SensorACKSupport) before sending
      • No crashes, no coordination needed, multi-tenant safe
      • Can remove NodeInventoryACK after 2-3 releases

      Technical Decisions

      Rate Limiter:

      • Reuse golang.org/x/time/rate.Limiter (proven implementation)
      • Equal-share fairness: perSensorRate = globalRate / numSensors
      • Dynamic rebalancing when sensors connect/disconnect
      • Burst capacity: perSensorRate * burstDuration

      Message Naming:

      • SensorACK instead of ComplianceACK - more accurate since message goes to Sensor
      • Generic design supports future message types (node inventory, node index, VM index)

      Configuration:

      • ROX_VM_INDEX_REPORT_RATE_LIMIT - requests per second (0 = unlimited, default: 0 for safe rollout)
      • ROX_VM_INDEX_REPORT_BURST_DURATION - burst window in seconds (default: 5s)

      Success Criteria

      • Central no longer experiences OOM under burst VM index report traffic
      • Each sensor receives fair share of capacity (observable via metrics)
      • Rejected reports retry successfully with exponential backoff
      • Scanner V4 processing stays under configured limit
      • End-to-end latency acceptable (ACK within 10s for accepted reports)

      Metrics Added

      Central: accept/reject counters per cluster
      Sensor: ACK/NACK forwarding counters
      Compliance: retry attempts and duration histograms

      Full visibility into entire flow from Compliance → Sensor → Central → Scanner V4 and back.

              prygiels@redhat.com Piotr Rygielski
              prygiels@redhat.com Piotr Rygielski
              ACS Sensor & Ecosystem
              Votes:
              0 Vote for this issue
              Watchers:
              1 Start watching this issue

                Created:
                Updated: