Uploaded image for project: 'Network Observability'
  1. Network Observability
  2. NETOBSERV-963

FLP pods are destroyed and recreated under load

    • False
    • None
    • False
    • NetObserv - Sprint 234
    • Important

      We have observed during recent 1.2 PerfScale testing repeated behavior where a cluster, once under load, FLP pods will exhibit the following behavior:

      1. Large groups of pods will be deleted (not restarted) and immediately recreated - these do not show as restarts as the pods are being replaced not restarted - video for reference flp.webm
      2. We have seen two different outcomes this has had on flows - in some cases flows continue to be processed (likely due to the presence of Kafka in these tests) however we have also seen scenarios wherein nodes that are hosting both LokiStack and FLP resources go into NotReady state and flows are dropped

      We initially thought this was due to cert reloads, but this does not seem to be the case - this behavior does not occur over time when a cluster is not under load - we did a dedicated test with a similar environment where no traffic was generated and the FLP pod behavior was not observed, pods remained stable. The working theory right now is that the issue is related to cluster resources/allocation.

      Since the issue has been recreated multiple times I'm opening this bug to serve as a tracker as we collect more data and try to identify the root cause for this behavior.

      Previous discussions relating to this bug:

            jtakvori Joel Takvorian
            nweinber1 Nathan Weinberg
            Nathan Weinberg Nathan Weinberg
            0 Vote for this issue
            6 Start watching this issue