Loading...

Type: Bug
Resolution: Unresolved
Priority: Blocker
Fix Version/s: 1.9
Affects Version/s: None
Component/s: optimization-plugin, optimization-plugin-qe
Labels:
- optimization
- qe
- triaged

Blocked:
False
Blocked Reason:

Hide

None

Show
None
Ready:
False
Intelligence Requested:
Market:

SFDC Cases Counter:
SFDC Cases Open:
SFDC Cases Links:

Resource Optimization plugin proxy to Cost Management API experiences intermittent 504/503 Gateway Timeout errors on initial requests to pagination and detail endpoints. Requests eventually succeed after token caching, indicating a proxy timeout configuration issue.

Description

The Resource Optimization backend plugin proxies requests to https://console.redhat.com/api/cost-management/v1 through Backstage's proxy middleware. While the main list endpoint (/recommendations/openshift?offset=0&limit=10) works consistently once initially loaded, the following scenarios experience intermittent timeouts:

Individual row detail requests (/recommendations/openshift/{uuid}) - First request often times out with 504, subsequent requests succeed
Pagination requests (/recommendations/openshift?offset=10&limit=10) - First request to new offset often times out with 504
Retry pattern - Failed requests eventually succeed when retried after a few seconds/minutes

Environment

RHDH Version: 1.8 STABLE-RC
Resource Optimization Plugin Version: 1.2.1
Orchestrator Plugin Version: 1.8.0-rc.3
Cluster: OpenShift (tested on ocp-edge73-0)
Namespace: rhdh-operator

Steps to Reproduce

Deploy RHDH 1.8 STABLE-RC with Resource Optimization plugin using --use-latest-stage-plugins
Deploy resource optimization plugin with valid ROS_CLIENT_ID and ROS_CLIENT_SECRET
Navigate to Optimizations tab in Backstage UI
Observe that main table loads successfully (first page, offset=0)
Click "Next Page" or paginate to offset=10
- Result: Request times out with 503/504 error
Click on a row to view details
- Result: Detail request times out with 503/504 error
Wait 1-2 minutes and retry the same actions
- Result: Requests now succeed (200 OK)

Expected Behavior

All proxy requests to Cost Management API should succeed consistently without timeout errors, regardless of:

Endpoint path (list, pagination, detail)
Request timing (initial vs subsequent)
Token acquisition status

Actual Behavior

Initial requests to pagination and detail endpoints timeout (504 Gateway Timeout)
Error occurs within ~0.6-1 second (indicating proxy timeout, not network timeout)
After waiting a few minutes, the same requests succeed
Pattern suggests first request triggers OAuth token acquisition which takes longer than proxy timeout

Error Logs

proxy error [HPM] Error occurred while proxying request .../recommendations/openshift/488e8d31-f93f-4fcb-9cce-f1c72dbdbcb6 to https://console.redhat.com/api/cost-management/v1 [ETIMEDOUT]
rootHttpRouter info "GET /api/proxy/cost-management/v1/recommendations/openshift?offset=10&limit=10&order_by=last_reported&order_how=desc HTTP/1.1" 504

Root Cause Analysis

Proxy Configuration: Proxy endpoint uses default timeout (likely 5-10 seconds)
Token Acquisition: Backend plugin acquires OAuth tokens from Red Hat SSO on-demand
First Request Delay: Initial requests to new endpoints require:
- Token fetch from https://sso.redhat.com/auth/realms/redhat-external/protocol/openid-connect/token
- API call to Cost Management endpoint
- Combined time exceeds proxy timeout
Subsequent Requests: Once token is cached, requests are faster and succeed

Proxy Configuration

Current configuration in dynamic-plugins ConfigMap:

proxy: 
  endpoints: 
    '/cost-management/v1':
      target: https://console.redhat.com/api/cost-management/v1
      allowedHeaders: ['Authorization']
      credentials: dangerously-allow-unauthenticated

Issue: No timeout configuration option available/exposed

Credentials Verification

ROS_CLIENT_ID and ROS_CLIENT_SECRET are correctly stored in backstage-backend-auth-secret
Environment variables are correctly set in Backstage pod
Direct API calls with credentials work successfully (200 OK responses)
Token acquisition succeeds when tested directly

Workaround

Wait 2-3 minutes after pod restart for token cache to populate
Retry failed requests after initial timeout
Use pagination slowly to allow time for token caching

Recommended Fix

Add timeout configuration to proxy endpoint configuration (e.g., timeout: 30000 for 30 seconds)
Implement token pre-fetching - Acquire OAuth token during plugin initialization rather than on first request
Increase default proxy timeout - If configurable timeout isn't possible, increase default timeout for cost-management endpoints
Better error handling - Return more descriptive errors when timeout occurs (e.g., "Token acquisition in progress, please retry")

Impact

Severity: Medium
User Experience: Degraded - Users experience timeouts when navigating optimization data
Frequency: Intermittent - Affects first requests to pagination/detail endpoints
Workaround Available: Yes (wait and retry)

Additional Notes

Main list endpoint (offset=0) consistently works
Detail endpoints that initially timeout eventually succeed (200 OK) after token caching
Pattern is consistent across different recommendation UUIDs and pagination offsets
Issue appears to be plugin/backend configuration limitation, not API availability

Technical Details

Proxy Middleware: http-proxy-middleware (HPM) via Backstage proxy plugin
Error Pattern: ETIMEDOUT errors within 0.6-1 second
Token Source: Red Hat SSO (https://sso.redhat.com/auth/realms/redhat-external)
API Target: https://console.redhat.com/api/cost-management/v1
Plugin: @red-hat-developer-hub/plugin-redhat-resource-optimization-backend-dynamic@1.2.1

Details

Description

Description

Environment

Steps to Reproduce

Expected Behavior

Actual Behavior

Error Logs

Root Cause Analysis

Proxy Configuration

Credentials Verification

Workaround

Recommended Fix

Impact

Additional Notes

Technical Details

Attachments

Easy Agile Planning Poker

Activity

People

Dates