-
Story
-
Resolution: Done
-
Major
-
None
-
None
-
None
-
False
-
None
-
False
-
OCPSTRAT-1207 - Improve Network Observability Operator performance with latest eBPF enhancements (bpfman, Tcx hook latest kernel & RHEL9.4)
-
-
-
NetObserv - Sprint 251, NetObserv - Sprint 252
I don't have a reproducer here, this is a purely theoritical issue by looking at the code.
Our BPF program, roughly, works by trying to update flow maps when new packets are received, and if it fails, send the 1-packet flow to the userspace via ringbuf
However, as commented here, that doesn't work when the map update failure occurs on an already existing flow: https://github.com/netobserv/netobserv-ebpf-agent/blob/main/bpf/flows.c#L94-L96
In this case, the packet is just dropped. So it leads to under-estimated metrics (bps, pps ...).
While we could certainly try to not drop those packets (e.g. by creating an ad-hoc one-packet flow to forward via RB), this may result in increasing the load on agent CPU (which already drops because being too busy) hence perhaps not something desirable.
What we can do however is to use a global to count drops, and expose this global to the user space, which would add that to the drops prometheus metric. So at least people know when these drops happen, so they can try toi further optimize the agent config to prevent that.