Uploaded image for project: 'Network Observability'
  1. Network Observability
  2. NETOBSERV-1954

Agent under-estimate traffic volume in stressed scenario

    • Icon: Bug Bug
    • Resolution: Done
    • Icon: Major Major
    • netobserv-1.8
    • netobserv-1.5, netobserv-1.6, netobserv-1.7
    • eBPF
    • None
    • False
    • None
    • False
    • NetObserv - Sprint 261, NetObserv - Sprint 262

      Under stress, the agent provides much lower pictures of the workload throughput than what cadvisor metrics provide.
      While there can be different reasons for that (dropped flows, deduplication, ...) I found a case that actually looks like a bug in the agent code: https://github.com/netobserv/netobserv-ebpf-agent/blob/main/pkg/flow/tracer_map.go#L149-L154

      Flows are ignored on purpose based on a past eviction timestamp. But this logic is flawed because flows lookup from map isn't batched or atomic. Each flow is read+deleted one by one. So they should have each their own last eviction timestamp, whereas using a single timestamp like today ends up eliminating too many flows.

      I'm also questioning if the assertion "eBPF hashmap values are not zeroed when the entry is removed" is actually true. Maybe we can just remove those lines.

      FWIW, testing with hey-ho:

      ./hey-ho.sh -r 3 -d 3 -z 10m -n 2 -q 50 -p -b -y 

      and comparing metrics with:

      sum(rate(netobserv_workload_ingress_bytes_total{DstK8S_Namespace=~"gallery.*",DstK8S_OwnerType="Deployment",SrcK8S_OwnerType="Deployment"}[2m]))
      
      sum(rate(container_network_receive_bytes_total{job="kubelet", metrics_path="/metrics/cadvisor", cluster="", namespace=~"gallery.+"}[2m])) 

      when removing these 3 lines, the two metrics finally perfectly align:

              jtakvori Joel Takvorian
              jtakvori Joel Takvorian
              Mehul Modi Mehul Modi
              Votes:
              0 Vote for this issue
              Watchers:
              2 Start watching this issue

                Created:
                Updated:
                Resolved: