• Icon: Story Story
    • Resolution: Done
    • Icon: Critical Critical
    • None
    • None
    • None
    • None
    • 2
    • False
    • None
    • False
    • OCPSTRAT-220 - Console: Customer Happiness (RFEs) for 4.14
    • HAC Infra OCP - Sprint 240

      Based on https://issues.redhat.com/browse/RFE-3775 we should be extending our proxy package timeout to match the browser's timeout, which is 5 minutes.

      AC: Bump the 30second timeout in the proxy pkg to 5 minutes

            [CONSOLE-3624] Increase proxy timeout

            On cluster without pr12781 code, go to Logs page, the histogram for 2 days show up:

            The us could be closed now.

            Yanping Zhang added a comment - On cluster without pr12781 code, go to Logs page, the histogram for 2 days show up: The us could be closed now.

            Prepared a cluster without code from pr12781, will show result after a long enough time for logs.

            Yanping Zhang added a comment - Prepared a cluster without code from pr12781, will show result after a long enough time for logs.

            Jon Jackson added a comment - - edited

            yanpzhan1 anli@redhat.com Can you test without the changes from the console backend PR?  I was unable to reproduce a case where a request was timed out prematurely by the console backend without those changes, so I think they may not be necessary. I also did some further investigation:

            1. It's unlikely request contexts are canceled by the console backend:
              The only contexts we manually create are Background or TODO contexts, which have no deadlines. This means these contexts will never cancel due to timeouts. We never manually cancel request contexts.
            2. It's unlikely that the reverse proxy is timing out requests:
              The Transport.DialerContext property determines the amount of time the dialer will wait for a connection, not a response. See the documentation here. The key point is: "Once successfully connected, any expiration of the context will not affect the connection."
              There is a Transport.ResponseHeaderTimeout property that would affect response timeout, but we don't set this property on our reverse proxy. It defaults to 0, which means our reverse proxy will not timeout waiting for a response.
            3. It's unlikely that the console server is timing out requests:
              As discussed before, http.Server doesn't have a write timeout unless one is explicitly set, and we do not have this property configured.

            Jon Jackson added a comment - - edited yanpzhan1 anli@redhat.com Can you test without the changes from the console backend PR ?  I was unable to reproduce a case where a request was timed out prematurely by the console backend without those changes, so I think they may not be necessary. I also did some further investigation: It's unlikely request contexts are canceled by the console backend: The only contexts we manually create are Background or TODO contexts, which have no deadlines. This means these contexts will never cancel due to timeouts. We never manually cancel request contexts. It's unlikely that the reverse proxy is timing out requests: The Transport.DialerContext property determines the amount of time the dialer will wait for a connection, not a response. See the documentation here . The key point is: "Once successfully connected, any expiration of the context will not affect the connection." There is a Transport.ResponseHeaderTimeout property that would affect response timeout, but we don't set this property on our reverse proxy. It defaults to 0 , which means our reverse proxy will not timeout waiting for a response. It's unlikely that the console server is timing out requests: As discussed before, http.Server  doesn't have a write timeout unless one is explicitly set, and we do not have this property configured.

            Checked on the cluster containing pr12781, now the histogram for 2 days show up:

            Yanping Zhang added a comment - Checked on the cluster containing pr12781, now the histogram for 2 days show up:

            Jon Jackson added a comment - - edited

            Okay, today I was able to create an end-to-end test case with a console dynamic plugin requesting a console-proxied service. The service was just a simple server with a 45-second response delay. The front end of the plugin is just making a request with the "consoleFetchText" function from the dynamic plugin SDK. Here is what I found:

            • Before setting the haproxy timeout annotation on the console route, I would receive a "Gateway Timeout" error after 30 seconds
            • After setting the haproxy annotation, the request succeeded after 45 seconds. Nothing between the console frontend and backend prevented a request from taking longer than 30 seconds.
            • I extended the delay on proxied service to 2 minutes and received a "timed out after 60000ms" error, which is the consoleFetch default.
            • I added a 3-minute timeout option to the "consoleFetchText" call on the frontend and this was where it got weird. The proxied service logs show that it sent a response after two minutes, but the frontend timed out after 3 minutes. I have a feeling the browser is dropping the connection.
            • I reduced the delay on the proxied service until I was able to get a successful response, which was at 59 seconds. So Chrome might be dropping the connection after 60 seconds without timing out the request.

            This shows that the 30-second timeout is likely coming from somewhere other than the console reverse proxy or frontend, and the changes to our backend are probably not necessary.

            Jon Jackson added a comment - - edited Okay, today I was able to create an end-to-end test case with a console dynamic plugin requesting a console-proxied service. The service was just a simple server with a 45-second response delay. The front end of the plugin is just making a request with the "consoleFetchText" function from the dynamic plugin SDK. Here is what I found: Before setting the haproxy timeout annotation on the console route, I would receive a "Gateway Timeout" error after 30 seconds After setting the haproxy annotation, the request succeeded after 45 seconds. Nothing between the console frontend and backend prevented a request from taking longer than 30 seconds. I extended the delay on proxied service to 2 minutes and received a "timed out after 60000ms" error, which is the consoleFetch default . I added a 3-minute timeout option to the "consoleFetchText" call on the frontend and this was where it got weird. The proxied service logs show that it sent a response after two minutes, but the frontend timed out after 3 minutes. I have a feeling the browser is dropping the connection. I reduced the delay on the proxied service until I was able to get a successful response, which was at 59 seconds. So Chrome might be dropping the connection after 60 seconds without timing out the request. This shows that the 30-second timeout is likely coming from somewhere other than the console reverse proxy or frontend, and the changes to our backend are probably not necessary.

            Jon Jackson added a comment -

            yanpzhan1 anli@redhat.com How can I get a test environment set up to reproduce this? It would be ideal if I could develop against a cluster that has Loki and at least 2 days of logs like the original reproducer.

            Jon Jackson added a comment - yanpzhan1 anli@redhat.com How can I get a test environment set up to reproduce this? It would be ideal if I could develop against a cluster that has Loki and at least 2 days of logs like the original reproducer.

            Jon Jackson added a comment -

            I was able to test these changes by creating a service on the cluster that accepts a request, delays for 45 seconds, then responds, then adding an endpoint to the console backend that proxies to that service, and finally making a request to that endpoint from the console frontend. I forgot to add the haproxy annotation to the console route at first and was getting 504 response after 30 seconds. Once I added the haproxy annotation, the requests were successful after the 45-second delay. I'm also working on creating a reproducer with a dynamic plugin just to mimic the original case as closely as possible. 

            Jon Jackson added a comment - I was able to test these changes by creating a service on the cluster that accepts a request, delays for 45 seconds, then responds, then adding an endpoint to the console backend that proxies to that service, and finally making a request to that endpoint from the console frontend. I forgot to add the haproxy annotation to the console route at first and was getting 504 response after 30 seconds. Once I added the haproxy annotation, the requests were successful after the 45-second delay. I'm also working on creating a reproducer with a dynamic plugin just to mimic the original case as closely as possible. 

            Jon Jackson added a comment -

            yanpzhan1 Can you test and attach a HAR file that contains the request details? That might help narrow things down. Also, what version of Loki are you using? It needs to be at least 5.7.3 to include a fix for this bug: https://issues.redhat.com/browse/LOG-3498

            rhn-support-ocasalsa could you also maybe help with this? I'm working on testing, but it takes some work to create a reproducer for this.

            Jon Jackson added a comment - yanpzhan1 Can you test and attach a HAR file that contains the request details? That might help narrow things down. Also, what version of Loki are you using? It needs to be at least 5.7.3 to include a fix for this bug: https://issues.redhat.com/browse/LOG-3498 rhn-support-ocasalsa could you also maybe help with this? I'm working on testing, but it takes some work to create a reproducer for this.

            Jon Jackson added a comment -

            I suspect this isn't working as expected because the request is being proxied to a plugin, which doesn't go through the same path as the proxy code I updated. Looking into that now.

            I looked into this and I don't think this is the case. Plugins use the same proxy that I updated with an extended timeout: https://github.com/openshift/console/blob/6b0fe90f67a47cdc126985206a4f71228dc955fe/pkg/server/server.go#L624C1-L625C1

            Jon Jackson added a comment - I suspect this isn't working as expected because the request is being proxied to a plugin, which doesn't go through the same path as the proxy code I updated. Looking into that now. I looked into this and I don't think this is the case. Plugins use the same proxy that I updated with an extended timeout: https://github.com/openshift/console/blob/6b0fe90f67a47cdc126985206a4f71228dc955fe/pkg/server/server.go#L624C1-L625C1

            Jon Jackson added a comment -

            I suspect this isn't working as expected because the request is being proxied to a plugin, which doesn't go through the same path as the proxy code I updated. Looking into that now.

            Jon Jackson added a comment - I suspect this isn't working as expected because the request is being proxied to a plugin, which doesn't go through the same path as the proxy code I updated. Looking into that now.

              rh-ee-jonjacks Jon Jackson
              jhadvig@redhat.com Jakub Hadvig
              Yanping Zhang Yanping Zhang
              Votes:
              0 Vote for this issue
              Watchers:
              5 Start watching this issue

                Created:
                Updated:
                Resolved: