Uploaded image for project: 'Network Observability'
  1. Network Observability
  2. NETOBSERV-1740

Discrepancies with drop queries between Loki & Prometheus

    • Icon: Bug Bug
    • Resolution: Unresolved
    • Icon: Normal Normal
    • None
    • None
    • Loki
    • None
    • False
    • None
    • False
    • NetObserv - Sprint 257, NetObserv - Sprint 258, NetObserv - Sprint 259, NetObserv - Sprint 260, NetObserv - Sprint 261, NetObserv - Sprint 262, NetObserv - Sprint 263, NetObserv - Sprint 264, NetObserv - Sprint 265

      The way metrics are queried differ between Loki and Prometheus when a filter is set on a drops-related label such as drop cause. Generated results are different.

      For instance, say we have a query with the filter DROP CAUSE=TCP_INVALID_SEQUENCE, and in query options "Drops filter" set to "All"

      • With Loki, the query returns the bytes/packet counters of all flows having some TCP_INVALID_SEQUENCE, including ones that haven't been dropped (unless user explicitly asks to return dropped bytes/packets)
      • With Prometheus, the query returns the dropped bytes/packet counters having TCP_INVALID_SEQUENCE.

      So we generally see lower values with Prometheus.


      2nd example: say we have a query with the filter DROP CAUSE!=TCP_INVALID_SEQUENCE (not equals), and in query options "Drops filter" set to "All"

      • With Loki, the query returns the bytes/packet counters of all flows having drops, but not caused by TCP_INVALID_SEQUENCE. For some reason, it does not take flows having no drop at all.
      • With Prometheus, the query returns the dropped bytes/packet counters not having TCP_INVALID_SEQUENCE. In this example, both Loki and Prometheus look wrong, because they both ignore flows without drops.

       

      To help reason about this issue, here's a more concrete example. Imagine we have just these 3 flows:

      FLOW 1 {X => Y, pkt: 30, dropPkt: 10, dropCause: CAUSE_A}
      FLOW 2 {X => Y, pkt: 40, dropPkt: 5, dropCause: CAUSE_B}
      FLOW 3 {X => Y, pkt: 30, (nodrop)}
      

      Flows contain a mix of dropped and not dropped packets.

      Translated into pseudo-metrics, here's what we get:

      metric_drops_packets{src=X,dst=Y,cause=CAUSE_A}: 10
      metric_drops_packets{src=X,dst=Y,cause=CAUSE_B}: 5
      metric_packets{src=X,dst=Y}: 100

      Now for different queries, here's expectation vs actual:

      Query:            Expected:     Actual Loki:    Actual Prometheus:
      (no filter)       100           100             100
      cause=CAUSE_A     10            30 (or 10)*     10
      cause=CAUSE_*     15            70 (or 15)*     15
      cause!=CAUSE_A    90            40 (or 5)*      5
      cause!=CAUSE_*    85            0 (or 0)*       0
      
      *: when setting explicitly dropBytes/Packets as the metric to fetch in the UI options 

       

       

              Unassigned Unassigned
              jtakvori Joel Takvorian
              Votes:
              0 Vote for this issue
              Watchers:
              1 Start watching this issue

                Created:
                Updated: