-
Bug
-
Resolution: Unresolved
-
Blocker
-
None
-
False
-
-
False
-
-
Resource Optimization plugin proxy to Cost Management API experiences intermittent 504/503 Gateway Timeout errors on initial requests to pagination and detail endpoints. Requests eventually succeed after token caching, indicating a proxy timeout configuration issue.
Description
The Resource Optimization backend plugin proxies requests to https://console.redhat.com/api/cost-management/v1 through Backstage's proxy middleware. While the main list endpoint (/recommendations/openshift?offset=0&limit=10) works consistently once initially loaded, the following scenarios experience intermittent timeouts:
- Individual row detail requests (/recommendations/openshift/{uuid}) - First request often times out with 504, subsequent requests succeed
- Pagination requests (/recommendations/openshift?offset=10&limit=10) - First request to new offset often times out with 504
- Retry pattern - Failed requests eventually succeed when retried after a few seconds/minutes
Environment
- RHDH Version: 1.8 STABLE-RC
- Resource Optimization Plugin Version: 1.2.1
- Orchestrator Plugin Version: 1.8.0-rc.3
- Cluster: OpenShift (tested on ocp-edge73-0)
- Namespace: rhdh-operator
Steps to Reproduce
- Deploy RHDH 1.8 STABLE-RC with Resource Optimization plugin using --use-latest-stage-plugins
- Deploy resource optimization plugin with valid ROS_CLIENT_ID and ROS_CLIENT_SECRET
- Navigate to Optimizations tab in Backstage UI
- Observe that main table loads successfully (first page, offset=0)
- Click "Next Page" or paginate to offset=10
- Result: Request times out with 503/504 error
- Click on a row to view details
- Result: Detail request times out with 503/504 error
- Wait 1-2 minutes and retry the same actions
- Result: Requests now succeed (200 OK)
Expected Behavior
All proxy requests to Cost Management API should succeed consistently without timeout errors, regardless of:
- Endpoint path (list, pagination, detail)
- Request timing (initial vs subsequent)
- Token acquisition status
Actual Behavior
- Initial requests to pagination and detail endpoints timeout (504 Gateway Timeout)
- Error occurs within ~0.6-1 second (indicating proxy timeout, not network timeout)
- After waiting a few minutes, the same requests succeed
- Pattern suggests first request triggers OAuth token acquisition which takes longer than proxy timeout
Error Logs
proxy error [HPM] Error occurred while proxying request .../recommendations/openshift/488e8d31-f93f-4fcb-9cce-f1c72dbdbcb6 to https://console.redhat.com/api/cost-management/v1 [ETIMEDOUT] rootHttpRouter info "GET /api/proxy/cost-management/v1/recommendations/openshift?offset=10&limit=10&order_by=last_reported&order_how=desc HTTP/1.1" 504
Root Cause Analysis
- Proxy Configuration: Proxy endpoint uses default timeout (likely 5-10 seconds)
- Token Acquisition: Backend plugin acquires OAuth tokens from Red Hat SSO on-demand
- First Request Delay: Initial requests to new endpoints require:
- Token fetch from https://sso.redhat.com/auth/realms/redhat-external/protocol/openid-connect/token
- API call to Cost Management endpoint
- Combined time exceeds proxy timeout
- Subsequent Requests: Once token is cached, requests are faster and succeed
Proxy Configuration
Current configuration in dynamic-plugins ConfigMap:
proxy: endpoints: '/cost-management/v1': target: https://console.redhat.com/api/cost-management/v1 allowedHeaders: ['Authorization'] credentials: dangerously-allow-unauthenticated
Issue: No timeout configuration option available/exposed
Credentials Verification
- ROS_CLIENT_ID and ROS_CLIENT_SECRET are correctly stored in backstage-backend-auth-secret
- Environment variables are correctly set in Backstage pod
- Direct API calls with credentials work successfully (200 OK responses)
- Token acquisition succeeds when tested directly
Workaround
- Wait 2-3 minutes after pod restart for token cache to populate
- Retry failed requests after initial timeout
- Use pagination slowly to allow time for token caching
Recommended Fix
- Add timeout configuration to proxy endpoint configuration (e.g., timeout: 30000 for 30 seconds)
- Implement token pre-fetching - Acquire OAuth token during plugin initialization rather than on first request
- Increase default proxy timeout - If configurable timeout isn't possible, increase default timeout for cost-management endpoints
- Better error handling - Return more descriptive errors when timeout occurs (e.g., "Token acquisition in progress, please retry")
Impact
- Severity: Medium
- User Experience: Degraded - Users experience timeouts when navigating optimization data
- Frequency: Intermittent - Affects first requests to pagination/detail endpoints
- Workaround Available: Yes (wait and retry)
Additional Notes
- Main list endpoint (offset=0) consistently works
- Detail endpoints that initially timeout eventually succeed (200 OK) after token caching
- Pattern is consistent across different recommendation UUIDs and pagination offsets
- Issue appears to be plugin/backend configuration limitation, not API availability
Technical Details
- Proxy Middleware: http-proxy-middleware (HPM) via Backstage proxy plugin
- Error Pattern: ETIMEDOUT errors within 0.6-1 second
- Token Source: Red Hat SSO (https://sso.redhat.com/auth/realms/redhat-external)
- API Target: https://console.redhat.com/api/cost-management/v1
- Plugin: @red-hat-developer-hub/plugin-redhat-resource-optimization-backend-dynamic@1.2.1