Loading...

XML

Word

Printable

Type: Story
Resolution: Done
Priority: Undefined
Fix Version/s: 1.2.0
Affects Version/s: None
Component/s: None
Labels:
- token-rate-limiting

Blocked:
False
Blocked Reason:

Hide

None

Show
None
Ready:
False
Intelligence Requested:
Market:

SFDC Cases Links:
SFDC Cases Open:
SFDC Cases Counter:

As a platform engineer, I want to enforce output token rate limits at the Gateway and HTTPRoute level so that I can control LLM model usage costs by rejecting requests from users who exceed their daily/hourly/etc token quota.

This story involves implementing output token-based rate limiting in a v1alpha TokenRateLimitPolicy API. The enforcement model is delayed—token usage is only known after the model response is parsed, so limits are applied on the next request once overage is detected

Technical points:

The policy must support both Gateway and HTTPRoute targetRef types with scoping via defaults & overrides
Output tokens must be parsed from the response body, which could be complete or streamed (chunked).
Parsed values include completion_tokens and/or output_tokens depending on the model's format (OpenAI-style, llama-stack, etc.).
Once parsed, the wasm-shim must increment a counter via Limitador to track token usage per user or group.
Policy authors should be able to configure what happens when a limit is breached (e.g. HTTP status, custom body).
CEL predicates will be used to extract user identity (e.g. auth.identity.userid) to enforce per-user quotas.

Limitations for the first iteration:

Limits are enforced only on the next request after usage exceeds the quota.
No Streaming support may be partial depending on when usage metrics arrive in the stream.
Metric parsing not yet resilient to different response formats.

Acceptance Criteria

TokenRateLimitPolicy supports both Gateway and HTTPRoute as targetRef
- Policy attachment works at both levels
- Scoping via defaults and overrides is respected per GEP-2649
- Policies can be defined with when clauses to scope by user, tenant, or other conditions
Response token usage is parsed from the response body
- Parsing works for OpenAI-style responses with a usage field
Response token usage is counted via Limitador
- Counter increments respect the correct rate limit window (e.g., per day/hour)
Enforcement occurs on the next request after quota is exceeded
- If a user exceeds their quota in a prior request, the next request will be rejected
- No inference call is made if the user has already exceeded the limit
Custom rejection responses are configurable
- Policy authors can define the status code, response body, and headers to be returned on limit breach
- The configured response is returned on enforcement (e.g., HTTP 429 with custom JSON)
Message usage parsing for streaming responses are not supported at this time.
Policy configuration supports CEL predicates
- Users can define predicates such as 'auth.identity.userid == "123"' to scope rate limits
- Invalid predicates result in validation failure or safe no-op behavior
Logs or events are available for debugging
- When a response is parsed, log entries indicate number of tokens and counter increment
- If enforcement is triggered, the user and reason are visible in logs
v1alpha schema is documented
- TokenRateLimitPolicy supports limit, when, counter, and response fields
- Example YAML manifests are provided for both Gateway and HTTPRoute targets
Known limitations are documented
- Documentation clearly states that enforcement is delayed
- Limitations around streaming and response format resilience are called out
- Expected behavior on malformed responses is described (e.g., parsing errors are logged but request completes)

links to

https://github.com/Kuadrant/kuadrant-operator/issues/1330

Assignee:: David Martin

Reporter:: David Martin

Votes:: 0 Vote for this issue

Watchers:: 1 Start watching this issue

Created:: 2025/07/24 8:47 AM

Updated:: 2025/11/26 12:04 PM

Resolved:: 2025/11/26 12:04 PM

Details

Description

Attachments

Issue Links

Easy Agile Planning Poker

Activity

People

Dates