Uploaded image for project: 'Connectivity Link'
  1. Connectivity Link
  2. CONNLINK-483

Output Token Rate Limiting at Gateway and HTTPRoute Level

XMLWordPrintable

    • Icon: Story Story
    • Resolution: Done
    • Icon: Undefined Undefined
    • 1.2.0
    • None
    • None
    • False
    • Hide

      None

      Show
      None
    • False

      As a platform engineer, I want to enforce output token rate limits at the Gateway and HTTPRoute level so that I can control LLM model usage costs by rejecting requests from users who exceed their daily/hourly/etc token quota.

      This story involves implementing output token-based rate limiting in a v1alpha TokenRateLimitPolicy API. The enforcement model is delayed—token usage is only known after the model response is parsed, so limits are applied on the next request once overage is detected

      Technical points:

      • The policy must support both Gateway and HTTPRoute targetRef types with scoping via defaults & overrides
      • Output tokens must be parsed from the response body, which could be complete or streamed (chunked).
      • Parsed values include completion_tokens and/or output_tokens depending on the model's format (OpenAI-style, llama-stack, etc.).
      • Once parsed, the wasm-shim must increment a counter via Limitador to track token usage per user or group.
      • Policy authors should be able to configure what happens when a limit is breached (e.g. HTTP status, custom body).
      • CEL predicates will be used to extract user identity (e.g. auth.identity.userid) to enforce per-user quotas.

      Limitations for the first iteration:

      • Limits are enforced only on the next request after usage exceeds the quota.
      • No Streaming support may be partial depending on when usage metrics arrive in the stream.
      • Metric parsing not yet resilient to different response formats.

       

      Acceptance Criteria

      • TokenRateLimitPolicy supports both Gateway and HTTPRoute as targetRef
        • Policy attachment works at both levels
        • Scoping via defaults and overrides is respected per GEP-2649
        • Policies can be defined with when clauses to scope by user, tenant, or other conditions
      • Response token usage is parsed from the response body
        • Parsing works for OpenAI-style responses with a usage field
      • Response token usage is counted via Limitador
        • Counter increments respect the correct rate limit window (e.g., per day/hour)
      • Enforcement occurs on the next request after quota is exceeded
        • If a user exceeds their quota in a prior request, the next request will be rejected
        • No inference call is made if the user has already exceeded the limit
      • Custom rejection responses are configurable
        • Policy authors can define the status code, response body, and headers to be returned on limit breach
        • The configured response is returned on enforcement (e.g., HTTP 429 with custom JSON)
      • Message usage parsing for streaming responses are not supported at this time.
      • Policy configuration supports CEL predicates
        • Users can define predicates such as 'auth.identity.userid == "123"' to scope rate limits
        • Invalid predicates result in validation failure or safe no-op behavior
      • Logs or events are available for debugging
        • When a response is parsed, log entries indicate number of tokens and counter increment
        • If enforcement is triggered, the user and reason are visible in logs
      • v1alpha schema is documented
        • TokenRateLimitPolicy supports limit, when, counter, and response fields
        • Example YAML manifests are provided for both Gateway and HTTPRoute targets
      • Known limitations are documented
        • Documentation clearly states that enforcement is delayed
        • Limitations around streaming and response format resilience are called out
        • Expected behavior on malformed responses is described (e.g., parsing errors are logged but request completes)

              davmarti@redhat.com David Martin
              davmarti@redhat.com David Martin
              Votes:
              0 Vote for this issue
              Watchers:
              1 Start watching this issue

                Created:
                Updated:
                Resolved: