Uploaded image for project: 'Network Observability'
  1. Network Observability
  2. NETOBSERV-1148

Mitigation for Loki ResourceExhausted error

    • Icon: Bug Bug
    • Resolution: Done
    • Icon: Major Major
    • None
    • netobserv-1.1, netobserv-1.3, netobserv-1.2, netobserv-1.4
    • Loki
    • False
    • None
    • False
    • NetObserv - Sprint 243, NetObserv - Sprint 244

      Loki may return a "ResourceExhausted" error, especially on ingestion, when batches that we're sending are too big.

      Loki's default for grpc message size is around 4 MB;
      Our default setting in CRD is 100KiB (which is a bit low, but safe) but in ALM example it is 10MB, which makes flows ingestion at risk to trigger this ResourceExhausted error.

      Loki operator has a much higher value ([100MB|
      https://github.com/grafana/loki/blob/main/operator/internal/manifests/internal/config/loki-config.yaml#L422] for all sizes) which is unlikely to be reached.

      We should:
      1. Document this in troubleshooting: basically, users need to configure Loki (server side) and netobserv (client side) accordingly, especially if they are not using the loki operator.
      2. As part, or as follow-up of NETOBSERV-764: we could set a default that depends on the loki installation method, ie. have a different default batch size for monolithic (like 3.5MB) and keep 10MB for others.

      3. Use same defaults between CRD and ALM example

      note that we need to have a security margin with the batch size, as batches may exceed it.

      cf slack thread https://redhat-internal.slack.com/archives/C02939DP5L5/p1688571940676889

              jtakvori Joel Takvorian
              jtakvori Joel Takvorian
              Nathan Weinberg Nathan Weinberg
              Votes:
              0 Vote for this issue
              Watchers:
              5 Start watching this issue

                Created:
                Updated:
                Resolved: