Uploaded image for project: 'OpenShift Logging'
  1. OpenShift Logging
  2. LOG-1059

fluentd pod OOMing

XMLWordPrintable

    • False
    • False
    • NEW
    • NEW
    • Undefined
    • Logging (Core) - Sprint 198

      Running ROSA Cloudwatch Logging PerfScale tests per https://docs.google.com/document/d/10vv_SVC7fUammkvdn6-05bGrzdAtXxiJ_Z6QMzfUL4A/edit# 

       

      I am seeing fluentd OOM looping about every 15 minutes when running:

      220 log-generator pods (on a single node) ; single container per pod

      250 messages per minute per pod

      512 byte message size

       

      Snip from the fluentd pod describe

      ```

      State: Running
      Started: Tue, 02 Feb 2021 16:34:42 +0000
      Last State: Terminated
      Reason: OOMKilled
      Exit Code: 137
      Started: Tue, 02 Feb 2021 16:18:32 +0000
      Finished: Tue, 02 Feb 2021 16:34:38 +0000
      Ready: True
      Restart Count: 2
      Limits:
      memory: 736Mi
      Requests:
      cpu: 100m
      memory: 736Mi

      ```

       

      To replicate the issue (assumes a cluster with cluster logging addon installed):

      Create log-generator namespace

      Pick a worker node that has 220 pod space available (ie less than 30 pods are currently running on the host)

      apply:

      ```

      apiVersion: batch/v1
      kind: Job
      metadata:
      name: log-generator
      namespace: log-generator
      spec:
      parallelism: 110
      completions: 110
      template:
      metadata:
      labels:
      name: log-generator
      spec:
      nodeSelector:
      kubernetes.io/hostname: ip-10-0-224-163
      containers:

      • image: quay.io/dry923/log_generator
        name: log-generator
        command: ["/usr/bin/python3", "/log_generator.py"]
        args: ["--size", "512", "--duration", "60", "--messages-per-minute", "250"]
        imagePullPolicy: Never
        restartPolicy: Never

      ```

      sleep 30 seconds (the openshift QPS will cause image back offs if you try and deploy all 220 at once; you will still see some but they recover quickly)

      apply

      ```

      apiVersion: batch/v1
      kind: Job
      metadata:
      name: log-generator2
      namespace: log-generator
      spec:
      parallelism: 110
      completions: 110
      template:
      metadata:
      labels:
      name: log-generator2
      spec:
      nodeSelector:
      kubernetes.io/hostname: ip-10-0-224-163
      containers:

      • image: quay.io/dry923/log_generator
        name: log-generator
        command: ["/usr/bin/python3", "/log_generator.py"]
        args: ["--size", "512", "--duration", "60", "--messages-per-minute", "250"]
        imagePullPolicy: Never
        restartPolicy: Never

      ```

       

      CPU usage for this pod is ~80% and the memory climbs from ~350Mb to its cap over 15-20 minutes causing the pod to OOM

        1. image-2021-02-02-12-22-09-102.png
          45 kB
          Russell Zaleski
        2. image-2021-02-02-12-25-21-021.png
          33 kB
          Russell Zaleski
        3. image-2021-02-02-12-26-09-073.png
          46 kB
          Russell Zaleski
        4. noname
          3 kB
          Alan Conway

              ikarpukh Igor Karpukhin (Inactive)
              rhn-support-rzaleski Russell Zaleski
              Anping Li Anping Li
              Votes:
              0 Vote for this issue
              Watchers:
              6 Start watching this issue

                Created:
                Updated:
                Resolved:

                  Estimated:
                  Original Estimate - 1 hour
                  1h
                  Remaining:
                  Remaining Estimate - 1 hour
                  1h
                  Logged:
                  Time Spent - Not Specified
                  Not Specified