Uploaded image for project: 'Hybrid Cloud Console'
  1. Hybrid Cloud Console
  2. RHCLOUD-43092

[Stage] Intermittent 403 errors in Stage caused by 10s timeout retrieving user entitlements

XMLWordPrintable

    • Quality / Stability / Reliability
    • False
    • Hide

      None

      Show
      None
    • False
    • None
    • Unset
    • None
    • Access & Management Sprint 120, Access & Management Sprint 121

      Cost Management is seeing intermittent 403 responses in the Stage environment when retrieving user entitlements through the Gateway. This issue has been observed in both Cost Management API test automation and UI automation, and it causes test setup failures and inconsistent results.

      The error occurs randomly, most requests succeed, but a subset fail with:

      HTTP response body: {"errors":[{"detail":"Unable to retrieve user entitlements in the stage environment","meta":{"response_by":"gateway"},"status":403}]}

      They are using an org admin user (their own user) that should have access to everything. This org. admin user is used by UI test automation. API test automation is using service account with "Cloud administrator", "Cost administrator" and "User Access administrator" roles. The entitlements issues are observed both in API and UI - so both with service account and org. admin user.

      Based on Kibana logs and analysis, all failing requests show "entitlements_time_taken": 10, suggesting a 10-second timeout when Gateway communicates with the Entitlements service. Successful requests typically show either "entitlements_cache_hit": true or "entitlements_time_taken": < 1.

      Example log excerpt:

      "status":"403",
      "response_by":"gateway",
      "entitlements_cache_hit":"false",
      "entitlements_time_taken":10,
      "authorization_forwarded":"false",
      "request":"GET /api/cost-management/v1/organizations/aws/"

      Impact:

      • Causes automated test jobs (API + UI) to fail intermittently.
      • Impacts Cost Management QE using Stage for validation.
      • Error rate is low overall but disruptive to automation reliability.

      Findings so far:

      • Affected users are org admins with valid entitlements.
      • Not related to subscription changes or insights-qa org accounts.
      • Confirmed by multiple users and visible in Kibana logs.
      • Gateway timeout to Entitlements service is set to 10s.
      • Appears to occur randomly, not tied to request frequency or caching.

      References:

      Slack thread 

      • Kibana examples:
      • Example job failure

      Requested Action:
      Investigate intermittent timeouts between Gateway and Entitlements service in Stage. Determine if the 10s timeout threshold is too low or if there are performance or caching issues on the Entitlements side.

      Notes:

      • Cost Management team is adding retries as a temporary workaround.

        1. image (12).png
          142 kB
          Ashley Morgan

              rh-ee-dagbay Daniel Agbay
              abaiken Ashley Morgan
              Votes:
              0 Vote for this issue
              Watchers:
              3 Start watching this issue

                Created:
                Updated: