-
Bug
-
Resolution: Done
-
Critical
-
netobserv-1.2
-
False
-
None
-
False
-
-
-
NetObserv - Sprint 234
-
Important
We have observed during recent 1.2 PerfScale testing repeated behavior where a cluster, once under load, FLP pods will exhibit the following behavior:
- Large groups of pods will be deleted (not restarted) and immediately recreated - these do not show as restarts as the pods are being replaced not restarted - video for reference flp.webm
- We have seen two different outcomes this has had on flows - in some cases flows continue to be processed (likely due to the presence of Kafka in these tests) however we have also seen scenarios wherein nodes that are hosting both LokiStack and FLP resources go into NotReady state and flows are dropped
We initially thought this was due to cert reloads, but this does not seem to be the case - this behavior does not occur over time when a cluster is not under load - we did a dedicated test with a similar environment where no traffic was generated and the FLP pod behavior was not observed, pods remained stable. The working theory right now is that the issue is related to cluster resources/allocation.
Since the issue has been recreated multiple times I'm opening this bug to serve as a tracker as we collect more data and try to identify the root cause for this behavior.
Previous discussions relating to this bug:
- relates to
-
NETOBSERV-684 Watch TLS certs & reload
- Closed
- split from
-
NETOBSERV-902 QE: Run performance tests for 1.2 release
- Closed
- links to
- mentioned on