Uploaded image for project: 'Data Foundation Bugs'
  1. Data Foundation Bugs
  2. DFBUGS-938

ceph-csi-controller-manager pods OOMKilled

XMLWordPrintable

    • Icon: Bug Bug
    • Resolution: Done-Errata
    • Icon: Critical Critical
    • odf-4.18
    • odf-4.17
    • ceph-csi-operator
    • None
    • False
    • Hide

      None

      Show
      None
    • False
    • Committed
    • ?
    • ?
    • 4.18.0-102
    • Committed
    • Hide
      Cause: On installing ODF, ceph-csi-controller-manager tries to cache all configmaps in the cluster
      Consequence: ceph-csi-controller-manager pod gets OOMKilled
      Fix: Cache is scoped to only to the namespace where ceph-csi-controller-manager pod is running
      Result: Stable memory usage by Pod and not OOMKilled
      Show
      Cause: On installing ODF, ceph-csi-controller-manager tries to cache all configmaps in the cluster Consequence: ceph-csi-controller-manager pod gets OOMKilled Fix: Cache is scoped to only to the namespace where ceph-csi-controller-manager pod is running Result: Stable memory usage by Pod and not OOMKilled
    • Bug Fix
    • Proposed
    • None

      Description of problem:

      The ceph-csi-controller-manager pod keep OOMKilled 

       

      The OCP platform infrastructure and deployment type

      Cluster is Bare Metal, installed with Assited Installer, and upagraded since 4.11

       

      The ODF deployment type (Internal, External, Internal-Attached (LSO), Multicluster, DR, Provider, etc):

      Internal ODF using 3 worker nodes and NVMe devices

       

      The version of all relevant components (OCP, ODF, RHCS, ACM whichever is applicable):

      cephcsi-operator.v4.17.0-rhodf
      
      

       

      Does this issue impact your ability to continue to work with the product?

      I don't know 

       

      Is there any workaround available to the best of your knowledge?

      We tried to bump the limits for the pod, based on https://access.redhat.com/solutions/7002548, but even with x30 of the orignal values, it still fails.

       

      Can this issue be reproduced? If so, please provide the hit rate

      We don't have another cluster to verify it. We were hit by this on our production cluster. 

       

       

      Can this issue be reproduced from the UI?

      If this is a regression, please provide more details to justify this:

      Steps to Reproduce:

      1.

      2.

      3.

      The exact date and time when the issue was observed, including timezone details:

       

      Actual results:

       

       

      Expected results:

       

      Logs collected and log location:

       

      Additional info:

       

              rhn-support-lgangava Leela Gangavarapu
              ryasharz@redhat.com Rabin Yasharzadehe
              Rabin Yasharzadehe
              Oded Viner Oded Viner
              Votes:
              0 Vote for this issue
              Watchers:
              24 Start watching this issue

                Created:
                Updated:
                Resolved: