-
Bug
-
Resolution: Done
-
Major
-
odf-4.18
-
None
-
True
-
-
False
-
Committed
-
?
-
x86_64
-
?
-
4.18.12-1.konflux
-
Committed
-
-
-
Important
-
Proposed
-
None
Description of problem - Provide a detailed description of the issue encountered, including logs/command-output snippets and screenshots if the issue is observed in the UI:
Customer stated that after upgrading their clusters from ODF 4.17.7 to 4.18.6 the ocs-metrics-exporter pod continuously crashes due to OOMKills.
Cluster #1: OCP Cluster ID: 0e7d9d45-756a-44dd-b971-b3e1262b66e6 Ceph Cluster ID: d9e2be69-3519-45dc-9c6e-c0577afcfdbb OCP Version: 4.18.15 ODF Version: 4.18.6 $ oc get pod ocs-metrics-exporter-6f945875f-mpgcz -o json | jq -c '.status.containerStatuses[] | {name: .name, restarts: .restartCount, exitCode: .lastState.terminated.exitCode, reason: .lastState.terminated.reason}' {"name":"kube-rbac-proxy-main","restarts":109,"exitCode":137,"reason":"OOMKilled"} {"name":"kube-rbac-proxy-self","restarts":23,"exitCode":137,"reason":"OOMKilled"} {"name":"ocs-metrics-exporter","restarts":0,"exitCode":null,"reason":null}
Cluster #2: OCP Cluster ID: 4daaa5fd-b76f-45eb-bb35-930f7abd41ef Ceph Cluster ID: d0a89bd3-b57e-4b44-91d0-412a63cf0492 OCP Version: 4.18.15 ODF Version: 4.18.6 $ oc get pod ocs-metrics-exporter-6f945875f-6nc4l -o json | jq -c '.status.containerStatuses[] | {name: .name, restarts: .restartCount, exitCode: .lastState.terminated.exitCode, reason: .lastState.terminated.reason}' {"name":"kube-rbac-proxy-main","restarts":2,"exitCode":137,"reason":"OOMKilled"} {"name":"kube-rbac-proxy-self","restarts":7,"exitCode":137,"reason":"OOMKilled"} {"name":"ocs-metrics-exporter","
I've attempted to increase resources for this deployment, but any changes I have made did not persist. The ocs-metrics-exporter pod is unstable in two clusters.
The OCP platform infrastructure and deployment type (AWS, Bare Metal, VMware, etc. Please clarify if it is platform agnostic deployment), (IPI/UPI):
VMWare
The ODF deployment type (Internal, External, Internal-Attached (LSO), Multicluster, DR, Provider, etc):
Internal-Attached (LSO)
Does this issue impact your ability to continue to work with the product?
Yes, customer states "Lost metrics and visibility for storage metrics"
Is there any workaround available to the best of your knowledge?
No. I attempted to set the do-not-reconcile annotation to the deployment and increase the resources, but these changes were reverted. Attempting to modify it via the storagecluster CR was unsuccessful as well
Can this issue be reproduced? If so, please provide the hit rate
Customer has this issue occurring in two separate clusters.
Can this issue be reproduced from the UI?
N/A
Actual results:
Unable to allocate more resources to the ocs-metrics-exporter
Expected results:
The ocs-metrics-exporter should not be getting OOMKilled or we should be able to increase the amount of resources allocated to this deployment via the storagecluster CR.
Logs collected and log location:
Case #04188878