Uploaded image for project: 'Distributed Tracing'
  1. Distributed Tracing
  2. TRACING-4511

Tempo: Jaeger UI queries fail with a 504 Gateway Time-out depending on result size and number of traces sent to TempoStack.

XMLWordPrintable

    • Tracing Sprint # 258, Tracing Sprint # 260
    • Important

      tempoVersion: 2.4.1 for both customer cases Iv'e seen so far 

      We seem to be hitting a limitation with tempo in regards to the Jaeger UI  querying and what amount of results it can handle. 

      Unless the customer, while querying the parameters LookBack & Limit Results is set to lower(Less than 20) a 504 is returned as seen in the tempo-query logs

       

      omg logs tempo-tempostack-query-frontend-8568765fcc-jvg6q -c tempo-query
      
      2024-07-26T07:59:11.561305708Z {"level":"error","ts":1721980751.561069,"caller":"app/http_handler.go:504","msg":"HTTP handler, Internal Server Error","error":"stream error: rpc error: code = Canceled desc = context canceled","stacktrace":"github.com/jaegertracing/jaeger/cmd/query/app.(*APIHandler).handleError\n\t/remote-source/jaeger/app/cmd/query/app/http_handler.go:504\ngithub.com/jaegertracing/jaeger/cmd/query/app.(*APIHandler).search\n\t/remote-source/jaeger/app/cmd/query/app/http_handler.go:259\nnet/http.HandlerFunc.ServeHTTP\n\t/usr/lib/golang/src/net/http/server.go:2166\ngithub.com/jaegertracing/jaeger/cmd/query/app.(*APIHandler).handleFunc.traceResponseHandler.func2\n\t/remote-source/jaeger/app/cmd/query/app/http_handler.go:548\nnet/http.HandlerFunc.ServeHTTP\n\t/usr/lib/golang/src/net/http/server.go:2166\ngo.opentelemetry.io/contrib/instrumentation/net/http/otelhttp.WithRouteTag.func1\n\t/remote-source/jaeger/deps/gomod/pkg/mod/go.opentelemetry.io/contrib/instrumentation/net/http/otelhttp@v0.51.0/handler.go:249\nnet/http.HandlerFunc.ServeHTTP\n\t/usr/lib/golang/src/net/http/server.go:2166\ngo.opentelemetry.io/contrib/instrumentation/net/http/otelhttp.(*middleware).serveHTTP\n\t/remote-source/jaeger/deps/gomod/pkg/mod/go.opentelemetry.io/contrib/instrumentation/net/http/otelhttp@v0.51.0/handler.go:212\ngo.opentelemetry.io/contrib/instrumentation/net/http/otelhttp.NewMiddleware.func1.1\n\t/remote-source/jaeger/deps/gomod/pkg/mod/go.opentelemetry.io/contrib/instrumentation/net/http/otelhttp@v0.51.0/handler.go:73\nnet/http.HandlerFunc.ServeHTTP\n\t/usr/lib/golang/src/net/http/server.go:2166\nnet/http.HandlerFunc.ServeHTTP\n\t/usr/lib/golang/src/net/http/server.go:2166\ngithub.com/gorilla/mux.(*Router).ServeHTTP\n\t/remote-source/jaeger/deps/gomod/pkg/mod/github.com/gorilla/mux@v1.8.1/mux.go:212\ngithub.com/jaegertracing/jaeger/cmd/query/app.createHTTPServer.additionalHeadersHandler.func4\n\t/remote-source/jaeger/app/cmd/query/app/additional_headers_handler.go:28\nnet/http.HandlerFunc.ServeHTTP\n\t/usr/lib/golang/src/net/http/server.go:2166\ngithub.com/jaegertracing/jaeger/cmd/query/app.createHTTPServer.PropagationHandler.func5\n\t/remote-source/jaeger/app/pkg/bearertoken/http.go:52\nnet/http.HandlerFunc.ServeHTTP\n\t/usr/lib/golang/src/net/http/server.go:2166\ngithub.com/jaegertracing/jaeger/cmd/query/app.createHTTPServer.CompressHandler.CompressHandlerLevel.func6\n\t/remote-source/jaeger/deps/gomod/pkg/mod/github.com/gorilla/handlers@v1.5.1/compress.go:141\nnet/http.HandlerFunc.ServeHTTP\n\t/usr/lib/golang/src/net/http/server.go:2166\ngithub.com/gorilla/handlers.recoveryHandler.ServeHTTP\n\t/remote-source/jaeger/deps/gomod/pkg/mod/github.com/gorilla/handlers@v1.5.1/recovery.go:78\nnet/http.serverHandler.ServeHTTP\n\t/usr/lib/golang/src/net/http/server.go:3137\nnet/http.(*conn).serve\n\t/usr/lib/golang/src/net/http/server.go:2039"} 

       

       

      We've tried using multiple options with the TempoStack CDR including:

      https://github.com/grafana/tempo-operator/blob/main/docs/spec/tempo.grafana.com_tempostacks.yaml

      1. Increasing replicas for query-frontend, tempo-querier, ingester
      2. Removing any resource limits or requests so it can use as much as it wants
      3. Messing with the defaultResultLimit value such as increasing it.

      None have had any effect. All the components in Tempo are working properly and no errors are being reported about ingestion or the pushing of the logs. I will note both customer cases I see this in are using OTEL for pushing logs. 

      Neither of these clusters are large. One of them is only a 6 nodes cluster.

      Both of these clusters are using ODF with S3 functionality. 

      Here is a tempostack resource spec from one customer 

      spec:
        observability:
          grafana:
            instanceSelector: {}
          metrics: {}
          tracing:
            jaeger_agent_endpoint: 'localhost:6831'
        resources: {}
        search:
          defaultResultLimit: 20
          maxDuration: 0s
        managementState: Managed
        limits:
          global:
            ingestion: {}
            query:
              maxSearchDuration: 0s
        serviceAccount: tempo-tempostack
        images: {}
        template:
          compactor:
            replicas: 2
          distributor:
            component:
              replicas: 2
            tls:
              enabled: false
          gateway:
            component: {}
            enabled: false
            ingress:
              route: {}
          ingester:
            replicas: 3
          querier:
            replicas: 3
          queryFrontend:
            component:
              replicas: 2
            jaegerQuery:
              enabled: true
              ingress:
                route:
                  termination: edge
                type: route
              monitorTab:
                enabled: false
                prometheusEndpoint: ''
              servicesQueryDuration: 72h0m0s
        replicationFactor: 1
        storage:
          secret:
            name: tempo-bucket
            type: s3
          tls:
            enabled: false
        storageSize: 10Gi
        hashRing:
          memberlist: {}
        retention:
          global:
            traces: 48h0m0s 

      Either way the key takeaway is that if we go about a certain number then we will get a 504 timeout after 30 seconds. This implies a limitation or that the tempo stack can't get the results back in time but these clusters are not large. 

      Wondering if were just hitting a limit with tempo or some sort of bug and how we can continue to troubleshoot this as we have exhausted options we can configure via the tempostack. Would increasing the timeout on the frontend route that's created help? 

      Please let me know if there's any metrics we can gather and if there's anything more we can adjust even unsupported to test with

      I''ll attach the cases via Jira 

       

              ploffay@redhat.com Pavol Loffay
              rhn-support-cruhm Courtney Ruhm
              Votes:
              0 Vote for this issue
              Watchers:
              14 Start watching this issue

                Created:
                Updated:
                Resolved: