-
Story
-
Resolution: Done
-
Undefined
-
None
-
None
-
None
-
None
-
Quality / Stability / Reliability
-
False
-
-
False
-
5
-
None
-
None
-
NetObserv - Sprint 276, NetObserv - Sprint 277, NetObserv - Sprint 282, NetObserv - Sprint 283
Run various tests using QE setup (perf-scale NDH/CD, reliability cluster...) and collect alerting data in order to refine the default thresholds setup, the promQL, etc.
Areas to improve:
Impact of sampling: e.g. drops or DNS alerts can be impacted by sampling, see also slack thread here: https://redhat-internal.slack.com/archives/C02939DP5L5/p1756911383230359- BUG created at: https://issues.redhat.com/browse/NETOBSERV-2613
There are DNS errors that happen frequently but aren't a real problem (at most, a performance issue): with k8s resolution on domains like `myservice.svc` will be search first as is, then as `myservice.svc.cluster.local`, etc., which triggers DNS domain not found regularly. Not sure how to tackle that. See also: https://kubernetes.io/docs/concepts/services-networking/dns-pod-service/ , https://medium.com/@GiteshWadhwa/optimizing-dns-resolution-in-kubernetes-best-practices-for-coredns-performance-e3f6ed041bbbMaybe create 2 different alert templates: one for NXDomain with "info" severity only and a message telling how to optimize; and another for all other codes- DONE
I think score can be improved, by making some changes in how severity impacts the score: for instance, we could say that critical alerts have a range in [0, 6], warning [4, 8] and info [6, 10] (as an example)