Uploaded image for project: 'Red Hat OpenShift AI Engineering'
  1. Red Hat OpenShift AI Engineering
  2. RHOAIENG-924

Unclear errors when the Dashboard receives 504 Timeout errors from the APIServer

XMLWordPrintable

    • False
    • Hide

      None

      Show
      None
    • False
    • Low

      Description of problem:

      as part of the large scale testing (2000 users), we observe that when many users cannot get a notebook (because the cluster is under-sized), the Dashboard shows various errors highlighting that the control plane is overloaded.

      Actual results:

      (no message in the console, which confuses me)

      {'level': 'SEVERE', 'message': 'https://rhods-dashboard-redhat-ods-applications.apps.odsci-pr665-sutest-1643308726075002880.psap.aws.rhperfscale.org/api/status - Failed to load resource: the server responded with a status of 504 (Gateway Time-out)', 'source': 'network', 'timestamp': 1680637096177}]
      
      {'level': 'SEVERE', 'message': 'https://rhods-dashboard-redhat-ods-applications.apps.odsci-pr665-sutest-1643308726075002880.psap.aws.rhperfscale.org/api/status - Failed to load resource: the server responded with a status of 504 (Gateway Time-out)', 'source': 'network', 'timestamp': 1680637592436}
      
      • [stuck waiting for resource list|]

        with these messages in the console:
      [{'level': 'SEVERE', 'message': 'https://rhods-dashboard-redhat-ods-applications.apps.odsci-pr665-sutest-1643308726075002880.psap.aws.rhperfscale.org/ - Failed to load resource: the server responded with a status of 403 (Forbidden)', 'source': 'network', 'timestamp': 1680636423860}, {'level': 'SEVERE', 'message': 'https://rhods-dashboard-redhat-ods-applications.apps.odsci-pr665-sutest-1643308726075002880.psap.aws.rhperfscale.org/app.bundle.js 1:827699 "Error fetching notebook events" wi: Call to /api/v1/namespaces/psapuser1000/events?fieldSelector=involvedObject.kind%3DPod%2CinvolvedObject.uid%3Dd796366f-e847-47eb-9b3b-5ed776e398e5 timed out after 60000ms\n    at https://rhods-dashboard-redhat-ods-applications.apps.odsci-pr665-sutest-1643308726075002880.psap.aws.rhperfscale.org/app.bundle.js:2:121975', 'source': 'console-api', 'timestamp': 1680637037273},
      
      • see also RHODS-7872, no error message shown when the notebook pod cannot be scheduled

      Prerequisites (if any, like setup, operators/versions):

      Steps to Reproduce

      working on a reproducer

      Expected results:

      • the dashboard does not overload the APIServer when Pods cannot be scheduled. We need to work on it together to better understand what's happening and how to prevent it
        ==> moved to a dedicated ticket RHODS-7874
      • the dashboard shows better/more user-friendly errors when the APIServer returns 50x error codes

      Reproducibility (Always/Intermittent/Only Once):

      Build Details:

      Workaround:

      Additional info:

            Unassigned Unassigned
            kpouget2 Kevin Pouget
            RHOAI Dashboard
            Votes:
            0 Vote for this issue
            Watchers:
            5 Start watching this issue

              Created:
              Updated:
              Resolved: