Loading...

Type: Bug
Resolution: Done
Priority: Normal
Fix Version/s: odf-4.17.10
Affects Version/s: odf-4.17
Component/s: ocs-operator
Labels:
None

Blocked:
False
Blocked Reason:

Hide

None

Show
None
Ready:
False
Dev Approval:
Committed
Docs Approval:
?
PM Approval:
?
Prod build version:
4.17.10-3
QE Approval:
Committed
Release Note Type:
Release Note Not Required
Target Release:

odf-4.17.10
Intelligence Requested:
Market:

Severity:
Moderate

Target Version:

odf-4.17.10

Regression:
None

SFDC Cases Links:
SFDC Cases Open:
SFDC Cases Counter:

Description of problem - Provide a detailed description of the issue encountered, including logs/command-output snippets and screenshots if the issue is observed in the UI:

We came in this morning to thousands of alerts on our cluster as well as over 6k pods in the openshift-storage project trying to start various rook-ceph-exporter pods on a given node.

$ oc get pods | grep ~~c rook-ceph-exporter~~<hostname>
5122
$ oc get pods | grep ~~c rook-ceph-crashcollector~~<hostname>
1522

This node was a host we were testing memory evictions on yesterday by creating large memory VM's (hundreds of GB). Therefore I have taken some initial recovery steps of stopping 4 large VM's and changed their memory profiles to only use 8gb if they restart. I'll attach a screenshot that had their original memory request sizes for review. Then I drained and rebooted the wn52 host, but the large number of ceph related pods still exist.

The OCP platform infrastructure and deployment type (AWS, Bare Metal, VMware, etc. Please clarify if it is platform agnostic deployment), (IPI/UPI):

ODF 4.18

The ODF deployment type (Internal, External, Internal-Attached (LSO), Multicluster, DR, Provider, etc):

Internal

The version of all relevant components (OCP, ODF, RHCS, ACM whichever is applicable):

$ omc get clusterversion
NAME      VERSION   AVAILABLE   PROGRESSING   SINCE   STATUS
version   4.18.5    True        False         40h     Cluster version is 4.18.5
$ omc get csv
NAME                                    DISPLAY                            VERSION        REPLACES                                           PHASE
cephcsi-operator.v4.17.5-rhodf          CephCSI operator                   4.17.5-rhodf   cephcsi-operator.v4.17.4-rhodf                     Succeeded
cluster-logging.v6.1.3                  Red Hat OpenShift Logging          6.1.3          cluster-logging.v6.1.2                             Succeeded
cluster-observability-operator.v1.0.0   Cluster Observability Operator     1.0.0          cluster-observability-operator.0.4.1               Succeeded
devworkspace-operator.v0.32.1           DevWorkspace Operator              0.32.1         devworkspace-operator.v0.31.2                      Succeeded
loki-operator.v6.1.3                    Loki Operator                      6.1.3          loki-operator.v6.1.2                               Succeeded
mcg-operator.v4.17.5-rhodf              NooBaa Operator                    4.17.5-rhodf   mcg-operator.v4.17.4-rhodf                         Succeeded
node-healthcheck-operator.v0.9.0        Node Health Check Operator         0.9.0          node-healthcheck-operator.v0.8.2                   Succeeded
node-maintenance-operator.v5.4.0        Node Maintenance Operator          5.4.0          node-maintenance-operator.v5.3.1                   Installing
ocs-client-operator.v4.17.5-rhodf       OpenShift Data Foundation Client   4.17.5-rhodf   ocs-client-operator.v4.17.4-rhodf                  Succeeded
ocs-operator.v4.17.5-rhodf              OpenShift Container Storage        4.17.5-rhodf   ocs-operator.v4.17.4-rhodf                         Succeeded
odf-csi-addons-operator.v4.17.5-rhodf   CSI Addons                         4.17.5-rhodf   odf-csi-addons-operator.v4.17.4-rhodf              Succeeded
odf-operator.v4.17.5-rhodf              OpenShift Data Foundation          4.17.5-rhodf   odf-operator.v4.17.4-rhodf                         Succeeded
odf-prometheus-operator.v4.17.5-rhodf   Prometheus Operator                4.17.5-rhodf   odf-prometheus-operator.v4.17.4-rhodf              Succeeded
openshift-gitops-operator.v1.15.1       Red Hat OpenShift GitOps           1.15.1         openshift-gitops-operator.v1.15.0-0.1738074324.p   Succeeded
recipe.v4.17.5-rhodf                    Recipe                             4.17.5-rhodf   recipe.v4.17.4-rhodf                               Succeeded
rook-ceph-operator.v4.17.5-rhodf        Rook-Ceph                          4.17.5-rhodf   rook-ceph-operator.v4.17.4-rhodf                   Succeeded
self-node-remediation.v0.10.0           Self Node Remediation Operator     0.10.0         self-node-remediation.v0.9.0                       Succeeded
web-terminal.v1.12.1                    Web Terminal                       1.12.1         web-terminal.v1.11.1                               Succeeded

Does this issue impact your ability to continue to work with the product?

Yes, this many pods slows down simple operations like 'get pods'

Is there any workaround available to the best of your knowledge?

No

Can this issue be reproduced? If so, please provide the hit rate

Yes, 100%

Can this issue be reproduced from the UI?

If this is a regression, please provide more details to justify this:

Steps to Reproduce:

Configure memory soft eviction kubelet configuration

I used `memory.available: 10%` but any % should work provided you can consume enough memory so that the node has less than this % available.

Wait for MCO / MCP to roll out the updated configuration to all nodes
Create some pods to generate load and consume enough memory resources to trigger the soft evictions.

As a note, I used Linux VMs running stress-ng for my testing, but you should be able to produce the same results by spinning up any stress test pods that consume the available memory.

At this point it was mostly waiting for the node to go into a MemoryPressure state and remain there for the evictionSoftGracePeriod threshold and watching the Kubelet log. Eventually you'll see the rook-ceph pods getting evicted and rescheduled over and over again.

The exact date and time when the issue was observed, including timezone details:

Actual results:

Expected results:

Logs collected and log location:

Additional info: