Uploaded image for project: 'Data Foundation Bugs'
  1. Data Foundation Bugs
  2. DFBUGS-397

[2314239] [MDS] "HEALTH_WARN" with "1 clients failing to respond to capability release" on Ceph Version 17.2.6-246.el9cp

XMLWordPrintable

    • Icon: Bug Bug
    • Resolution: Unresolved
    • Icon: Critical Critical
    • None
    • odf-4.14
    • ceph/CephFS/x86
    • None
    • False
    • Hide

      None

      Show
      None
    • False
    • ?
    • ?
    • If docs needed, set a value
    • None

      Description of problem (please be detailed as possible and provide log snippets):

      This case originally opened up due to the customer hitting slow ops as a result of SELinux relabeling. We fixed this and were in the process of monitoring before case closure. The customer reported everything was working great and when we were about to close the case the customer hit the "HEALTH_WARN" with "1 clients failing to respond to capability release" issue.

      Upon further research, the customer should not be hitting this issue since they're on ceph version 17.2.6-246.el9cp, or 6.1z7 - 6.1.7 and from what I am tracking the fix was implemented in 6.1z4.

      The good news is that the customer states that storage is working fine, and we have this tracked down to one specific workload causing this issue (This is a very high file count workload):

      Name: pvc-d272cca3-9f94-423c-ad37-f4993caf2949
      Labels: <none>
      Annotations: pv.kubernetes.io/provisioned-by: openshift-storage.cephfs.csi.ceph.com
      volume.kubernetes.io/provisioner-deletion-secret-name: rook-csi-cephfs-provisioner
      volume.kubernetes.io/provisioner-deletion-secret-namespace: openshift-storage
      Finalizers: [kubernetes.io/pv-protection]
      StorageClass: vivo-cephfs-selinux-relabel
      Status: Bound
      Claim: osb-fscustomer-esteira2/osb-fscustomer-esteira2-pvc <------ WORKLOAD
      Reclaim Policy: Delete
      Access Modes: RWX
      VolumeMode: Filesystem
      Capacity: 30Gi
      Node Affinity: <none>
      Message:
      Source:
      Type: CSI (a Container Storage Interface (CSI) volume source)
      Driver: openshift-storage.cephfs.csi.ceph.com
      FSType:
      VolumeHandle: 0001-0011-openshift-storage-0000000000000001-aab3519a-5c94-4874-abb1-6cb90ececb6a
      ReadOnly: false
      VolumeAttributes: clusterID=openshift-storage
      fsName=ocs-storagecluster-cephfilesystem
      kernelMountOptions=context="system_u:object_r:container_file_t:s0"
      storage.kubernetes.io/csiProvisionerIdentity=1726082784973-9392-openshift-storage.cephfs.csi.ceph.com
      subvolumeName=csi-vol-aab3519a-5c94-4874-abb1-6cb90ececb6a
      subvolumePath=/volumes/csi/csi-vol-aab3519a-5c94-4874-abb1-6cb90ececb6a/05ed12dd-b991-4bf0-9b9e-35611555e642
      Events: <none>

      Specifically two particular operations cause this issue:

      Issue A and Issue B:

      A. PVC clone operation or PVC content copy from pod towards an internal Node folder.

      B. Trying to create 5 Pods (osb-fscustomer-server1, osb-fscustomer-server2, osb-fscustomer-server3, osb-fscustomer-server4, osb-fscustomer-server5) pointing out to osb-fscustomer-esteira2-pvc.

      Since we're able to successfully reproduce this issue, the customer has been given a data collection process that should give Engineering all the logs/data needed.

      Version of all relevant components (if applicable):

      OCP:
      NAME VERSION AVAILABLE PROGRESSING SINCE STATUS
      version 4.14.33 True False 49d Cluster version is 4.14.33

      ODF:
      NAME DISPLAY VERSION REPLACES PHASE
      argocd-operator.v0.12.0 Argo CD 0.12.0 argocd-operator.v0.11.0 Succeeded
      cluster-logging.v5.9.6 Red Hat OpenShift Logging 5.9.6 cluster-logging.v5.9.5 Succeeded
      loki-operator.v5.9.6 Loki Operator 5.9.6 loki-operator.v5.9.5 Succeeded
      mcg-operator.v4.14.10-rhodf NooBaa Operator 4.14.10-rhodf mcg-operator.v4.14.9-rhodf Succeeded
      ocs-operator.v4.14.10-rhodf OpenShift Container Storage 4.14.10-rhodf ocs-operator.v4.14.9-rhodf Succeeded
      odf-csi-addons-operator.v4.14.10-rhodf CSI Addons 4.14.10-rhodf odf-csi-addons-operator.v4.14.9-rhodf Succeeded
      odf-operator.v4.14.10-rhodf OpenShift Data Foundation 4.14.10-rhodf odf-operator.v4.14.9-rhodf Succeeded

      Ceph:

      {
      "mon":

      { "ceph version 17.2.6-246.el9cp (0f65af2d95ce0936640f6ccd6a4825dce6237e4f) quincy (stable)": 3 }

      ,
      "mgr":

      { "ceph version 17.2.6-246.el9cp (0f65af2d95ce0936640f6ccd6a4825dce6237e4f) quincy (stable)": 1 }

      ,
      "osd":

      { "ceph version 17.2.6-246.el9cp (0f65af2d95ce0936640f6ccd6a4825dce6237e4f) quincy (stable)": 3 }

      ,
      "mds":

      { "ceph version 17.2.6-246.el9cp (0f65af2d95ce0936640f6ccd6a4825dce6237e4f) quincy (stable)": 2 }

      ,
      "rgw":

      { "ceph version 17.2.6-246.el9cp (0f65af2d95ce0936640f6ccd6a4825dce6237e4f) quincy (stable)": 1 }

      ,
      "overall":

      { "ceph version 17.2.6-246.el9cp (0f65af2d95ce0936640f6ccd6a4825dce6237e4f) quincy (stable)": 10 }

      }

      Does this issue impact your ability to continue to work with the product (please explain in detail what is the user impact)?

      Yes, the customer wants to roll this in production and they have raised this issue stating this is crucial to fix due to current timelines. Below is the following statement from the customer:

      "This issue is impacting a major project with Telefonica Brazil. If this is a bug, we urgently need a Bugzilla report created as soon as possible to facilitate a timely resolution."

      Is there any workaround available to the best of your knowledge?

      Yes, the following process will clear the issue, but the operations A and B do not succeed:

      1. Run the following command to capture client ID and MDS name holding on to the caps:

      $ oc exec -n openshift-storage deployment/rook-ceph-tools – ceph health detail

      2. Run the following command to capture the session:

      $ oc exec -n openshift-storage deployment/rook-ceph-tools – ceph tell mds.ocs-storagecluster-cephfilesystem:0 session ls > active-mds-session-ls.txt

      3. Search for "<client.ID>" in that active-mds-session-ls.txt file. When you find that client session, scroll down until you see the csi-vol information. Copy the csi-vol-xxx-xxx all the way until the /. Then run the following command:

      $ oc get pv | awk 'NR>1

      {print $1}

      ' | while read it; do oc describe pv ${it}; echo " "; done > pv.out

      4. Search for that csi-vol that you found in that client session in that pv.out file. That's your problematic workload. Scale it down.

      5. Once all pods for that workload have terminated, delete the `rook-ceph-mds-ocs-storagecluster-cephfilesystem-a` pod:

      ~~~
      $ oc delete pod n openshift-storage rook-ceph-mds-ocs-storagecluster-cephfilesystem-a<pod-name>
      ~~~

      6. After a few minutes Ceph will get back to healthy.

      Rate from 1 - 5 the complexity of the scenario you performed that caused this
      bug (1 - very simple, 5 - very complex)?

      Can this issue reproducible?

      Yes

      If this is a regression, please provide more details to justify this:

      Steps to Reproduce:
      1. PVC clone operation or PVC content copy from pod towards an internal Node folder.

      or

      2. Trying to create 5 Pods (osb-fscustomer-server1, osb-fscustomer-server2, osb-fscustomer-server3, osb-fscustomer-server4, osb-fscustomer-server5) pointing out to osb-fscustomer-esteira2-pvc.

      Additional info:

      I will put the data collection steps given to the customer in the private comment. Once collected, we'll set needinfo.

              vshankar@redhat.com Venky Shankar
              rhn-support-crwayman Craig Wayman
              Elad Ben Aharon Elad Ben Aharon
              Votes:
              0 Vote for this issue
              Watchers:
              10 Start watching this issue

                Created:
                Updated: