Uploaded image for project: 'Data Foundation Bugs'
  1. Data Foundation Bugs
  2. DFBUGS-4125

[Backport to 4.18.z][GSS]ocs-metrics-exporter OOMKills

XMLWordPrintable

    • Icon: Bug Bug
    • Resolution: Done
    • Icon: Major Major
    • odf-4.18.12
    • odf-4.18
    • ceph-monitoring
    • None
    • True
    • Hide

      Customer is unable to upgrade their clusters as a result of this bug.

      Show
      Customer is unable to upgrade their clusters as a result of this bug.
    • False
    • Committed
    • ?
    • x86_64
    • ?
    • 4.18.12-1.konflux
    • Committed
    • Important
    • Proposed
    • None

      Description of problem - Provide a detailed description of the issue encountered, including logs/command-output snippets and screenshots if the issue is observed in the UI:

      Customer stated that after upgrading  their clusters from ODF 4.17.7 to 4.18.6 the ocs-metrics-exporter pod continuously crashes due to OOMKills.

      Cluster #1:
      OCP Cluster ID: 0e7d9d45-756a-44dd-b971-b3e1262b66e6
      Ceph Cluster ID: d9e2be69-3519-45dc-9c6e-c0577afcfdbb
      OCP Version: 4.18.15
      ODF Version: 4.18.6
      
      $ oc get pod ocs-metrics-exporter-6f945875f-mpgcz -o json | jq -c '.status.containerStatuses[] | {name: .name, restarts: .restartCount, exitCode: .lastState.terminated.exitCode, reason: .lastState.terminated.reason}'
      {"name":"kube-rbac-proxy-main","restarts":109,"exitCode":137,"reason":"OOMKilled"}
      {"name":"kube-rbac-proxy-self","restarts":23,"exitCode":137,"reason":"OOMKilled"}
      {"name":"ocs-metrics-exporter","restarts":0,"exitCode":null,"reason":null}

       

      Cluster #2:
      OCP Cluster ID: 4daaa5fd-b76f-45eb-bb35-930f7abd41ef
      Ceph Cluster ID: d0a89bd3-b57e-4b44-91d0-412a63cf0492
      OCP Version: 4.18.15
      ODF Version: 4.18.6
      
      $ oc get pod ocs-metrics-exporter-6f945875f-6nc4l -o json | jq -c '.status.containerStatuses[] | {name: .name, restarts: .restartCount, exitCode: .lastState.terminated.exitCode, reason: .lastState.terminated.reason}'
      {"name":"kube-rbac-proxy-main","restarts":2,"exitCode":137,"reason":"OOMKilled"}
      {"name":"kube-rbac-proxy-self","restarts":7,"exitCode":137,"reason":"OOMKilled"}
      {"name":"ocs-metrics-exporter"," 

       

       

      I've attempted to increase resources for this deployment, but any changes I have made did not persist. The ocs-metrics-exporter pod is unstable in two clusters.

       

      The OCP platform infrastructure and deployment type (AWS, Bare Metal, VMware, etc. Please clarify if it is platform agnostic deployment), (IPI/UPI):

      VMWare

      The ODF deployment type (Internal, External, Internal-Attached (LSO), Multicluster, DR, Provider, etc):

      Internal-Attached (LSO)

      Does this issue impact your ability to continue to work with the product?

      Yes, customer states "Lost metrics and visibility for storage metrics"

      Is there any workaround available to the best of your knowledge?

      No. I attempted to set the do-not-reconcile annotation to the deployment and increase the resources, but these changes were reverted. Attempting to modify it via the storagecluster CR was unsuccessful as well

       

      Can this issue be reproduced? If so, please provide the hit rate

      Customer has this issue occurring in two separate clusters.

      Can this issue be reproduced from the UI?

      N/A

      Actual results:

      Unable to allocate more resources to the ocs-metrics-exporter

      Expected results:

      The ocs-metrics-exporter should not be getting OOMKilled or we should be able to increase the amount of resources allocated to this deployment via the storagecluster CR.

      Logs collected and log location:

      Case #04188878

              dkamboj@redhat.com Divyansh Kamboj
              rhn-support-rlaberin Ryan Laberinto
              Divyansh Kamboj
              Thotakura Chaitanya Thotakura Chaitanya
              Votes:
              0 Vote for this issue
              Watchers:
              24 Start watching this issue

                Created:
                Updated:
                Resolved: