Uploaded image for project: 'FlightPath'
  1. FlightPath
  2. FLPATH-2832

Resource Optimization plugin proxy timeout issues on pagination and detail requests

XMLWordPrintable

    • False
    • Hide

      None

      Show
      None
    • False

      Resource Optimization plugin proxy to Cost Management API experiences intermittent 504/503 Gateway Timeout errors on initial requests to pagination and detail endpoints. Requests eventually succeed after token caching, indicating a proxy timeout configuration issue.

      Description

      The Resource Optimization backend plugin proxies requests to https://console.redhat.com/api/cost-management/v1 through Backstage's proxy middleware. While the main list endpoint (/recommendations/openshift?offset=0&limit=10) works consistently once initially loaded, the following scenarios experience intermittent timeouts:

      • Individual row detail requests (/recommendations/openshift/{uuid}) - First request often times out with 504, subsequent requests succeed
      • Pagination requests (/recommendations/openshift?offset=10&limit=10) - First request to new offset often times out with 504
      • Retry pattern - Failed requests eventually succeed when retried after a few seconds/minutes

      Environment

      • RHDH Version: 1.8 STABLE-RC
      • Resource Optimization Plugin Version: 1.2.1
      • Orchestrator Plugin Version: 1.8.0-rc.3
      • Cluster: OpenShift (tested on ocp-edge73-0)
      • Namespace: rhdh-operator

      Steps to Reproduce

      1. Deploy RHDH 1.8 STABLE-RC with Resource Optimization plugin using --use-latest-stage-plugins
      2. Deploy resource optimization plugin with valid ROS_CLIENT_ID and ROS_CLIENT_SECRET
      3. Navigate to Optimizations tab in Backstage UI
      4. Observe that main table loads successfully (first page, offset=0)
      5. Click "Next Page" or paginate to offset=10
        • Result: Request times out with 503/504 error
      6. Click on a row to view details
        • Result: Detail request times out with 503/504 error
      7. Wait 1-2 minutes and retry the same actions
        • Result: Requests now succeed (200 OK)

      Expected Behavior

      All proxy requests to Cost Management API should succeed consistently without timeout errors, regardless of:

      • Endpoint path (list, pagination, detail)
      • Request timing (initial vs subsequent)
      • Token acquisition status

      Actual Behavior

      • Initial requests to pagination and detail endpoints timeout (504 Gateway Timeout)
      • Error occurs within ~0.6-1 second (indicating proxy timeout, not network timeout)
      • After waiting a few minutes, the same requests succeed
      • Pattern suggests first request triggers OAuth token acquisition which takes longer than proxy timeout

      Error Logs

      proxy error [HPM] Error occurred while proxying request .../recommendations/openshift/488e8d31-f93f-4fcb-9cce-f1c72dbdbcb6 to https://console.redhat.com/api/cost-management/v1 [ETIMEDOUT]
      rootHttpRouter info "GET /api/proxy/cost-management/v1/recommendations/openshift?offset=10&limit=10&order_by=last_reported&order_how=desc HTTP/1.1" 504
      

      Root Cause Analysis

      1. Proxy Configuration: Proxy endpoint uses default timeout (likely 5-10 seconds)
      2. Token Acquisition: Backend plugin acquires OAuth tokens from Red Hat SSO on-demand
      3. First Request Delay: Initial requests to new endpoints require:
      4. Subsequent Requests: Once token is cached, requests are faster and succeed

      Proxy Configuration

      Current configuration in dynamic-plugins ConfigMap:

      proxy: 
        endpoints: 
          '/cost-management/v1':
            target: https://console.redhat.com/api/cost-management/v1
            allowedHeaders: ['Authorization']
            credentials: dangerously-allow-unauthenticated
      

      Issue: No timeout configuration option available/exposed

      Credentials Verification

      • ROS_CLIENT_ID and ROS_CLIENT_SECRET are correctly stored in backstage-backend-auth-secret
      • Environment variables are correctly set in Backstage pod
      • Direct API calls with credentials work successfully (200 OK responses)
      • Token acquisition succeeds when tested directly

      Workaround

      • Wait 2-3 minutes after pod restart for token cache to populate
      • Retry failed requests after initial timeout
      • Use pagination slowly to allow time for token caching

      Recommended Fix

      1. Add timeout configuration to proxy endpoint configuration (e.g., timeout: 30000 for 30 seconds)
      2. Implement token pre-fetching - Acquire OAuth token during plugin initialization rather than on first request
      3. Increase default proxy timeout - If configurable timeout isn't possible, increase default timeout for cost-management endpoints
      4. Better error handling - Return more descriptive errors when timeout occurs (e.g., "Token acquisition in progress, please retry")

      Impact

      • Severity: Medium
      • User Experience: Degraded - Users experience timeouts when navigating optimization data
      • Frequency: Intermittent - Affects first requests to pagination/detail endpoints
      • Workaround Available: Yes (wait and retry)

      Additional Notes

      • Main list endpoint (offset=0) consistently works
      • Detail endpoints that initially timeout eventually succeed (200 OK) after token caching
      • Pattern is consistent across different recommendation UUIDs and pagination offsets
      • Issue appears to be plugin/backend configuration limitation, not API availability

      Technical Details

       

              ydayagi yaron dayagi
              gharden1 Gary Harden
              Votes:
              0 Vote for this issue
              Watchers:
              3 Start watching this issue

                Created:
                Updated: