Uploaded image for project: 'Network Observability'
  1. Network Observability
  2. NETOBSERV-730

NO Controller and Kafka pods crashing in large-scale deployment

    • Icon: Bug Bug
    • Resolution: Done
    • Icon: Major Major
    • None
    • None
    • Kafka
    • False
    • None
    • False
    • NetObserv - Sprint 229
    • Important

      Ran Test Bed 3 with node-density-heavy using the the following modifications to the Loki limits

        limits:
          global:
            ingestion:
              ingestionBurstSize: 100
              ingestionRate: 500
              maxGlobalStreamsPerTenant: 50000 

      Flows did process initially, but after about 25 minutes the NO Controller and Kafka Zookeeper pods began failing

      NO Controller had the following event occur multiple times - my suspicion is this is due to exceeding Pod memory limits as rhn-support-memodi previously observed in a different cluster

      Kafka Zookeeper pods began failing shortly after NO Controller, but I am not sure as to why - I did see the following error when inspecting the Zookeeper pods:

        Warning  Unhealthy               58m                 kubelet                  Readiness probe errored: rpc error: code = NotFound desc = container is not created or running: checking if PID of e12b296239a1f3cf92903744ef2ea01b0a94e2b9237def6faccb52fe856ca0e7 is running failed: container process not found 

      As you can see in the above chart, I canceled the workload and eventually all Zookeeper pods recovered and flows began processing again - however NO Controller remains unstable.

        1. flowcollector.yaml
          6 kB
          Nathan Weinberg
        2. image-2022-11-30-14-52-33-949.png
          26 kB
          Nathan Weinberg
        3. image-2022-11-30-14-52-45-414.png
          22 kB
          Nathan Weinberg
        4. inspect.local.6443639870937166607.tar.xz
          316 kB
          Nathan Weinberg
        5. inspect.local.8236614351318825367.tar.xz
          1.31 MB
          Nathan Weinberg

            jtakvori Joel Takvorian
            nweinber1 Nathan Weinberg
            Votes:
            0 Vote for this issue
            Watchers:
            5 Start watching this issue

              Created:
              Updated:
              Resolved: