-
Bug
-
Resolution: Done
-
Normal
-
odf-4.17
-
None
-
False
-
-
False
-
Committed
-
?
-
?
-
4.17.10-3
-
Committed
-
Release Note Not Required
-
-
-
Moderate
-
None
Description of problem - Provide a detailed description of the issue encountered, including logs/command-output snippets and screenshots if the issue is observed in the UI:
We came in this morning to thousands of alerts on our cluster as well as over 6k pods in the openshift-storage project trying to start various rook-ceph-exporter pods on a given node.
$ oc get pods | grep c rook-ceph-exporter<hostname>
5122
$ oc get pods | grep c rook-ceph-crashcollector<hostname>
1522
This node was a host we were testing memory evictions on yesterday by creating large memory VM's (hundreds of GB). Therefore I have taken some initial recovery steps of stopping 4 large VM's and changed their memory profiles to only use 8gb if they restart. I'll attach a screenshot that had their original memory request sizes for review. Then I drained and rebooted the wn52 host, but the large number of ceph related pods still exist.
The OCP platform infrastructure and deployment type (AWS, Bare Metal, VMware, etc. Please clarify if it is platform agnostic deployment), (IPI/UPI):
ODF 4.18
The ODF deployment type (Internal, External, Internal-Attached (LSO), Multicluster, DR, Provider, etc):
Internal
The version of all relevant components (OCP, ODF, RHCS, ACM whichever is applicable):
$ omc get clusterversion NAME VERSION AVAILABLE PROGRESSING SINCE STATUS version 4.18.5 True False 40h Cluster version is 4.18.5 $ omc get csv NAME DISPLAY VERSION REPLACES PHASE cephcsi-operator.v4.17.5-rhodf CephCSI operator 4.17.5-rhodf cephcsi-operator.v4.17.4-rhodf Succeeded cluster-logging.v6.1.3 Red Hat OpenShift Logging 6.1.3 cluster-logging.v6.1.2 Succeeded cluster-observability-operator.v1.0.0 Cluster Observability Operator 1.0.0 cluster-observability-operator.0.4.1 Succeeded devworkspace-operator.v0.32.1 DevWorkspace Operator 0.32.1 devworkspace-operator.v0.31.2 Succeeded loki-operator.v6.1.3 Loki Operator 6.1.3 loki-operator.v6.1.2 Succeeded mcg-operator.v4.17.5-rhodf NooBaa Operator 4.17.5-rhodf mcg-operator.v4.17.4-rhodf Succeeded node-healthcheck-operator.v0.9.0 Node Health Check Operator 0.9.0 node-healthcheck-operator.v0.8.2 Succeeded node-maintenance-operator.v5.4.0 Node Maintenance Operator 5.4.0 node-maintenance-operator.v5.3.1 Installing ocs-client-operator.v4.17.5-rhodf OpenShift Data Foundation Client 4.17.5-rhodf ocs-client-operator.v4.17.4-rhodf Succeeded ocs-operator.v4.17.5-rhodf OpenShift Container Storage 4.17.5-rhodf ocs-operator.v4.17.4-rhodf Succeeded odf-csi-addons-operator.v4.17.5-rhodf CSI Addons 4.17.5-rhodf odf-csi-addons-operator.v4.17.4-rhodf Succeeded odf-operator.v4.17.5-rhodf OpenShift Data Foundation 4.17.5-rhodf odf-operator.v4.17.4-rhodf Succeeded odf-prometheus-operator.v4.17.5-rhodf Prometheus Operator 4.17.5-rhodf odf-prometheus-operator.v4.17.4-rhodf Succeeded openshift-gitops-operator.v1.15.1 Red Hat OpenShift GitOps 1.15.1 openshift-gitops-operator.v1.15.0-0.1738074324.p Succeeded recipe.v4.17.5-rhodf Recipe 4.17.5-rhodf recipe.v4.17.4-rhodf Succeeded rook-ceph-operator.v4.17.5-rhodf Rook-Ceph 4.17.5-rhodf rook-ceph-operator.v4.17.4-rhodf Succeeded self-node-remediation.v0.10.0 Self Node Remediation Operator 0.10.0 self-node-remediation.v0.9.0 Succeeded web-terminal.v1.12.1 Web Terminal 1.12.1 web-terminal.v1.11.1 Succeeded
Does this issue impact your ability to continue to work with the product?
Yes, this many pods slows down simple operations like 'get pods'
Is there any workaround available to the best of your knowledge?
No
Can this issue be reproduced? If so, please provide the hit rate
Yes, 100%
Can this issue be reproduced from the UI?
If this is a regression, please provide more details to justify this:
Steps to Reproduce:
- Configure memory soft eviction kubelet configuration
- I used `memory.available: 10%` but any % should work provided you can consume enough memory so that the node has less than this % available.
- Wait for MCO / MCP to roll out the updated configuration to all nodes
- Create some pods to generate load and consume enough memory resources to trigger the soft evictions.
- As a note, I used Linux VMs running stress-ng for my testing, but you should be able to produce the same results by spinning up any stress test pods that consume the available memory.
- At this point it was mostly waiting for the node to go into a MemoryPressure state and remain there for the evictionSoftGracePeriod threshold and watching the Kubelet log. Eventually you'll see the rook-ceph pods getting evicted and rescheduled over and over again.
The exact date and time when the issue was observed, including timezone details:
Actual results:
Expected results:
Logs collected and log location:
Additional info: