Uploaded image for project: 'Data Foundation Bugs'
  1. Data Foundation Bugs
  2. DFBUGS-1917

[GSS]Rook-ceph-exporter pods keep spinning up

XMLWordPrintable

    • Icon: Bug Bug
    • Resolution: Done-Errata
    • Icon: Normal Normal
    • odf-4.19
    • odf-4.17
    • ocs-operator
    • None
    • False
    • Hide

      None

      Show
      None
    • False
    • Committed
    • ?
    • ?
    • 4.19.0-49.konflux
    • Committed
    • Release Note Not Required
    • Moderate
    • None

       

      Description of problem - Provide a detailed description of the issue encountered, including logs/command-output snippets and screenshots if the issue is observed in the UI:

      We came in this morning to thousands of alerts on our cluster as well as over 6k pods in the openshift-storage project trying to start various rook-ceph-exporter pods on a given node.

      $ oc get pods | grep c rook-ceph-exporter<hostname>
      5122
      $ oc get pods | grep c rook-ceph-crashcollector<hostname>
      1522

      This node was a host we were testing memory evictions on yesterday by creating large memory VM's (hundreds of GB). Therefore I have taken some initial recovery steps of stopping 4 large VM's and changed their memory profiles to only use 8gb if they restart. I'll attach a screenshot that had their original memory request sizes for review. Then I drained and rebooted the wn52 host, but the large number of ceph related pods still exist.

      The OCP platform infrastructure and deployment type (AWS, Bare Metal, VMware, etc. Please clarify if it is platform agnostic deployment), (IPI/UPI):

      ODF 4.18

      The ODF deployment type (Internal, External, Internal-Attached (LSO), Multicluster, DR, Provider, etc):

      Internal

       

      The version of all relevant components (OCP, ODF, RHCS, ACM whichever is applicable):

      $ omc get clusterversion
      NAME      VERSION   AVAILABLE   PROGRESSING   SINCE   STATUS
      version   4.18.5    True        False         40h     Cluster version is 4.18.5
      $ omc get csv
      NAME                                    DISPLAY                            VERSION        REPLACES                                           PHASE
      cephcsi-operator.v4.17.5-rhodf          CephCSI operator                   4.17.5-rhodf   cephcsi-operator.v4.17.4-rhodf                     Succeeded
      cluster-logging.v6.1.3                  Red Hat OpenShift Logging          6.1.3          cluster-logging.v6.1.2                             Succeeded
      cluster-observability-operator.v1.0.0   Cluster Observability Operator     1.0.0          cluster-observability-operator.0.4.1               Succeeded
      devworkspace-operator.v0.32.1           DevWorkspace Operator              0.32.1         devworkspace-operator.v0.31.2                      Succeeded
      loki-operator.v6.1.3                    Loki Operator                      6.1.3          loki-operator.v6.1.2                               Succeeded
      mcg-operator.v4.17.5-rhodf              NooBaa Operator                    4.17.5-rhodf   mcg-operator.v4.17.4-rhodf                         Succeeded
      node-healthcheck-operator.v0.9.0        Node Health Check Operator         0.9.0          node-healthcheck-operator.v0.8.2                   Succeeded
      node-maintenance-operator.v5.4.0        Node Maintenance Operator          5.4.0          node-maintenance-operator.v5.3.1                   Installing
      ocs-client-operator.v4.17.5-rhodf       OpenShift Data Foundation Client   4.17.5-rhodf   ocs-client-operator.v4.17.4-rhodf                  Succeeded
      ocs-operator.v4.17.5-rhodf              OpenShift Container Storage        4.17.5-rhodf   ocs-operator.v4.17.4-rhodf                         Succeeded
      odf-csi-addons-operator.v4.17.5-rhodf   CSI Addons                         4.17.5-rhodf   odf-csi-addons-operator.v4.17.4-rhodf              Succeeded
      odf-operator.v4.17.5-rhodf              OpenShift Data Foundation          4.17.5-rhodf   odf-operator.v4.17.4-rhodf                         Succeeded
      odf-prometheus-operator.v4.17.5-rhodf   Prometheus Operator                4.17.5-rhodf   odf-prometheus-operator.v4.17.4-rhodf              Succeeded
      openshift-gitops-operator.v1.15.1       Red Hat OpenShift GitOps           1.15.1         openshift-gitops-operator.v1.15.0-0.1738074324.p   Succeeded
      recipe.v4.17.5-rhodf                    Recipe                             4.17.5-rhodf   recipe.v4.17.4-rhodf                               Succeeded
      rook-ceph-operator.v4.17.5-rhodf        Rook-Ceph                          4.17.5-rhodf   rook-ceph-operator.v4.17.4-rhodf                   Succeeded
      self-node-remediation.v0.10.0           Self Node Remediation Operator     0.10.0         self-node-remediation.v0.9.0                       Succeeded
      web-terminal.v1.12.1                    Web Terminal                       1.12.1         web-terminal.v1.11.1                               Succeeded 

       

      Does this issue impact your ability to continue to work with the product?

      Yes, this many pods slows down simple operations like 'get pods'

       

      Is there any workaround available to the best of your knowledge?

      No

       

      Can this issue be reproduced? If so, please provide the hit rate

      Yes, 100%

       

      Can this issue be reproduced from the UI?

       

      If this is a regression, please provide more details to justify this:

       

      Steps to Reproduce:

      1. Configure memory soft eviction kubelet configuration
      • I used `memory.available: 10%` but any % should work provided you can consume enough memory so that the node has less than this % available.
      1. Wait for MCO / MCP to roll out the updated configuration to all nodes
      2. Create some pods to generate load and consume enough memory resources to trigger the soft evictions.
      • As a note, I used Linux VMs running stress-ng for my testing, but you should be able to produce the same results by spinning up any stress test pods that consume the available memory. 
      1. At this point it was mostly waiting for the node to go into a MemoryPressure state and remain there for the evictionSoftGracePeriod threshold and watching the Kubelet log. Eventually you'll see the rook-ceph pods getting evicted and rescheduled over and over again.

       

      The exact date and time when the issue was observed, including timezone details:

       

       

      Actual results:

       

       

      Expected results:

       

      Logs collected and log location:

       

      Additional info:

       
       

              paarora@redhat.com Parth Arora
              rhn-support-kelwhite Kelson White
              Parth Arora
              Vishakha Kathole Vishakha Kathole
              Votes:
              0 Vote for this issue
              Watchers:
              31 Start watching this issue

                Created:
                Updated:
                Resolved: