Uploaded image for project: 'Network Observability'
  1. Network Observability
  2. NETOBSERV-1470

netobserv-ebpf-agent performance degradation between 1.5 and 1.4.2

    • Icon: Bug Bug
    • Resolution: Done
    • Icon: Major Major
    • None
    • netobserv-1.5-candidate
    • eBPF
    • False
    • None
    • False
    • NetObserv - Sprint 248, NetObserv - Sprint 249, NetObserv - Sprint 250, NetObserv - Sprint 251, NetObserv - Sprint 252
    • Important

      Background

      While finding the solution to NETOBSERV-1458 on Jan 26 I discovered that the eBPF memory usage for the 120 node cluster-density-v2 test was 94% higher than the baseline run from the 1.4.2 test which was run on Nov 21.

      While cluster-density-v2 had not been run since then we have been doing weekly runs of the 25 node node-density-heavy test. The last successful baseline of this particular test was Jan 8 - subsequent runs showed stable memory on Jan 15 and an increase on Jan 22 but this was fixed by mmahmoud@redhat.com the following day - as such we had no indicator at this smaller scale of any memory increase of this severity.

      Attempted Solutions Thus Far

      Add flag to PacketDrop to account for RHEL9.3 behavior netobserv-ebpf-agent/pull/258 Saw 109.57% increase over 1.4.2
      Removed some Loki labels network-observability-operator/pull/552 Saw 125.28% increase over 1.4.2 but also 61.70% more flows were processed so the number is less severe than it looks
      Reduce maxGlobalStreamsPerTenant from 200000 to 150000 N/A Saw 88.98% increase over 1.4.2
      Reduce overall number of Loki streams network-observability-operator/pull/554 Saw 102.56% increase over 1.4.2 but also 71.37% more flows were processed so the number is less severe than it looks
      Rerun of above test at request of mmahmoud@redhat.com  network-observability-operator/pull/554 Saw 99.06% increase over 1.4.2 but also 69.55% more flows were processed so the number is less severe than it looks
      PR image + increased eBPF pod memory limit from 800Mi default to 2000Mi network-observability-operator/pull/554 Saw 107.47% increase over 1.4.2 but also 101.23% more flows were processed
      Run with default settings using bundle 104 N/A Saw 107.35% increase over 1.4.2, but -1.50% less flows
      Reran 1.4.2 N/A Saw 0.79% increase over original baseline with 1.64% more flows, essentially the same (did have issues with FLP auth with Loki so no flows were written)
      Run with default settings using bundle 107 N/A Saw 130.67% increase over 1.4.2, but 0.17% more flows
      Bundle 107 rerun with pprof N/A Saw 103.52% increase over 1.4.2, but -0.04% less flows
      Lower default kafkaBatchSize, eBPF memory limit of 1600Mi network-observability-operator/pull/566 Saw -24.34% decrease in eBPF memory usage but 133.54% increase in Loki memory usage
      Lower default kafkaBatchSize, eBPF memory limit of 800Mi network-observability-operator/pull/566 Saw -25.61% decrease in eBPF memory usage but 155.27% increase in Loki memory usage

       

            Unassigned Unassigned
            nweinber1 Nathan Weinberg
            Nathan Weinberg Nathan Weinberg
            Votes:
            0 Vote for this issue
            Watchers:
            5 Start watching this issue

              Created:
              Updated:
              Resolved: