• Analytics MVP
    • False
    • False
    • Done
    • Undefined

      It enhances the Distributed Tracing experience on OpenShift and benefits our customers with a better understanding of the traces.

      Relevance and Importance

      Today, the usage of Jaeger is limited to trace collection and presenting the traditional traceview diagram. Arguably, this might be perceived as little value for what some people might think of as a big investment. Incrementally building analytics capabilities based on early feedback might help Red Hat steer the distributed tracing efforts to unleash the full potential of traces for helping SREs.

      A. MVP Strategy

      In order to enable a quick feedback loop with users, customers, and/or internal teams to whom we show the analytics capabilities, it would be desirable to rely on an existing interactive interface, such as Jupyter Notebooks. Following this approach, we would implement the analytics supporting our initial key use cases as functions in a Notebook. These functions would serve as entry points for the key use cases we will support. Critically, they would process traces stored on the Jaeger backend. This way, the demos will involve both the actual Jaeger UI and the Notebook with analytics.

       

      When demonstrating it to the target audience, we can tell users that the function invocations they are seeing would correspond to a UI flow yet to be designed. Similarly, each invoked function should produce useful visualizations on the Notebook. Again, as the capabilities are demonstrated, we can point out that the produced visualizations embedded in the Notebook could be incorporated into a UI view or views yet to be designed. 

       

      This approach would be agile. Not only will it enable us to get early feedback on the perceived value of the analytics capabilities, but it will also encourage user feedback on the UI views prior to or during their design.

       

      B. Initial Use Cases

       

      Below is a list of initial use cases we can aim for.

      B.1 Service-centric troubleshooting

       

      Given a time interval and the name of a service that is problematic, show to the user:

      1. - Representative trace aggregates, including histograms for latency and error rates for each span, and counts of retries/timeouts.
      2. - The critical path (longest span) in each representative aggregate.
      3. - The tag that more strongly correlates with the critical path (and the corresponding correlation coefficient), which should help "explain" the critical path. Finding which tag correlates with the critical path might help identify, for instance, that a slowdown occurs when a particular service version is invoked or when the call is coming from a particular server.
      4. The histogram of calls made by the chosen service to its downstream services, highlighting how frequently each downstream service is called and statistics on durations and errors.

       

      B.2 Trace aggregation (group-by operations)

       

      Given a time interval and a group-by parameter, show to the user:

      1. Trace aggregates, including histograms for latency and error rates for each span, and counts of retries/timeouts.

      We can choose an initial aggregation group-by operation to start and incrementally aggregations based on a variety of parameters.

       

      B.3 Trace aggregation diff 

       

      Given two aggregates produced as in B.2, highlight key differences between them. The differences might include significant changes in (1) latency distributions, (2) error rates, and (3) graph shape.

       

      B. Timeline for MVP

      Following the agile methodology, as a team we should choose one use case to start with so that we can have an end-to-end implementation via Notebook. As we get feedback on the first use case, we move on to the next. As we receive feedback, work on actual UI design might be desirable.

      We can plan to have at least the first Notebook-based use case (chosen as a team) by August 1st, 2021. Use cases implemented after that should move relatively quickly, capitalizing on all initial efforts to both generate test traces and put together a foundational development and testbed environment. Note that B.1(1)--(4) can be thought of as individual use cases.

       

      How does this align with the strategy ?

      • It focus on enhance the capabilities of the backend 
      • This can be a good attempt to provide "smart" capabilities into Distributed Tracing and collect feedback from our customers. 
      • Analytics capabilities will be incredibly important for a potential Hosted Distributed Tracing service and a major differential capability.

              esnible Ed Snible (Inactive)
              jpkroehling@redhat.com Juraci Paixão Kröhling (Inactive)
              Votes:
              1 Vote for this issue
              Watchers:
              4 Start watching this issue

                Created:
                Updated:
                Resolved: