-
Bug
-
Resolution: Unresolved
-
Critical
-
None
-
odf-4.14
-
None
Description of problem (please be detailed as possible and provide log snippets):
This case originally opened up due to the customer hitting slow ops as a result of SELinux relabeling. We fixed this and were in the process of monitoring before case closure. The customer reported everything was working great and when we were about to close the case the customer hit the "HEALTH_WARN" with "1 clients failing to respond to capability release" issue.
Upon further research, the customer should not be hitting this issue since they're on ceph version 17.2.6-246.el9cp, or 6.1z7 - 6.1.7 and from what I am tracking the fix was implemented in 6.1z4.
The good news is that the customer states that storage is working fine, and we have this tracked down to one specific workload causing this issue (This is a very high file count workload):
Name: pvc-d272cca3-9f94-423c-ad37-f4993caf2949
Labels: <none>
Annotations: pv.kubernetes.io/provisioned-by: openshift-storage.cephfs.csi.ceph.com
volume.kubernetes.io/provisioner-deletion-secret-name: rook-csi-cephfs-provisioner
volume.kubernetes.io/provisioner-deletion-secret-namespace: openshift-storage
Finalizers: [kubernetes.io/pv-protection]
StorageClass: vivo-cephfs-selinux-relabel
Status: Bound
Claim: osb-fscustomer-esteira2/osb-fscustomer-esteira2-pvc <------ WORKLOAD
Reclaim Policy: Delete
Access Modes: RWX
VolumeMode: Filesystem
Capacity: 30Gi
Node Affinity: <none>
Message:
Source:
Type: CSI (a Container Storage Interface (CSI) volume source)
Driver: openshift-storage.cephfs.csi.ceph.com
FSType:
VolumeHandle: 0001-0011-openshift-storage-0000000000000001-aab3519a-5c94-4874-abb1-6cb90ececb6a
ReadOnly: false
VolumeAttributes: clusterID=openshift-storage
fsName=ocs-storagecluster-cephfilesystem
kernelMountOptions=context="system_u:object_r:container_file_t:s0"
storage.kubernetes.io/csiProvisionerIdentity=1726082784973-9392-openshift-storage.cephfs.csi.ceph.com
subvolumeName=csi-vol-aab3519a-5c94-4874-abb1-6cb90ececb6a
subvolumePath=/volumes/csi/csi-vol-aab3519a-5c94-4874-abb1-6cb90ececb6a/05ed12dd-b991-4bf0-9b9e-35611555e642
Events: <none>
Specifically two particular operations cause this issue:
Issue A and Issue B:
A. PVC clone operation or PVC content copy from pod towards an internal Node folder.
B. Trying to create 5 Pods (osb-fscustomer-server1, osb-fscustomer-server2, osb-fscustomer-server3, osb-fscustomer-server4, osb-fscustomer-server5) pointing out to osb-fscustomer-esteira2-pvc.
Since we're able to successfully reproduce this issue, the customer has been given a data collection process that should give Engineering all the logs/data needed.
Version of all relevant components (if applicable):
OCP:
NAME VERSION AVAILABLE PROGRESSING SINCE STATUS
version 4.14.33 True False 49d Cluster version is 4.14.33
ODF:
NAME DISPLAY VERSION REPLACES PHASE
argocd-operator.v0.12.0 Argo CD 0.12.0 argocd-operator.v0.11.0 Succeeded
cluster-logging.v5.9.6 Red Hat OpenShift Logging 5.9.6 cluster-logging.v5.9.5 Succeeded
loki-operator.v5.9.6 Loki Operator 5.9.6 loki-operator.v5.9.5 Succeeded
mcg-operator.v4.14.10-rhodf NooBaa Operator 4.14.10-rhodf mcg-operator.v4.14.9-rhodf Succeeded
ocs-operator.v4.14.10-rhodf OpenShift Container Storage 4.14.10-rhodf ocs-operator.v4.14.9-rhodf Succeeded
odf-csi-addons-operator.v4.14.10-rhodf CSI Addons 4.14.10-rhodf odf-csi-addons-operator.v4.14.9-rhodf Succeeded
odf-operator.v4.14.10-rhodf OpenShift Data Foundation 4.14.10-rhodf odf-operator.v4.14.9-rhodf Succeeded
Ceph:
{
"mon":
,
"mgr":
,
"osd":
,
"mds":
,
"rgw":
,
"overall":
}
Does this issue impact your ability to continue to work with the product (please explain in detail what is the user impact)?
Yes, the customer wants to roll this in production and they have raised this issue stating this is crucial to fix due to current timelines. Below is the following statement from the customer:
"This issue is impacting a major project with Telefonica Brazil. If this is a bug, we urgently need a Bugzilla report created as soon as possible to facilitate a timely resolution."
Is there any workaround available to the best of your knowledge?
Yes, the following process will clear the issue, but the operations A and B do not succeed:
1. Run the following command to capture client ID and MDS name holding on to the caps:
$ oc exec -n openshift-storage deployment/rook-ceph-tools – ceph health detail
2. Run the following command to capture the session:
$ oc exec -n openshift-storage deployment/rook-ceph-tools – ceph tell mds.ocs-storagecluster-cephfilesystem:0 session ls > active-mds-session-ls.txt
3. Search for "<client.ID>" in that active-mds-session-ls.txt file. When you find that client session, scroll down until you see the csi-vol information. Copy the csi-vol-xxx-xxx all the way until the /. Then run the following command:
$ oc get pv | awk 'NR>1
{print $1}' | while read it; do oc describe pv ${it}; echo " "; done > pv.out
4. Search for that csi-vol that you found in that client session in that pv.out file. That's your problematic workload. Scale it down.
5. Once all pods for that workload have terminated, delete the `rook-ceph-mds-ocs-storagecluster-cephfilesystem-a` pod:
~~~
$ oc delete pod n openshift-storage rook-ceph-mds-ocs-storagecluster-cephfilesystem-a<pod-name>
~~~
6. After a few minutes Ceph will get back to healthy.
Rate from 1 - 5 the complexity of the scenario you performed that caused this
bug (1 - very simple, 5 - very complex)?
Can this issue reproducible?
Yes
If this is a regression, please provide more details to justify this:
Steps to Reproduce:
1. PVC clone operation or PVC content copy from pod towards an internal Node folder.
or
2. Trying to create 5 Pods (osb-fscustomer-server1, osb-fscustomer-server2, osb-fscustomer-server3, osb-fscustomer-server4, osb-fscustomer-server5) pointing out to osb-fscustomer-esteira2-pvc.
Additional info:
I will put the data collection steps given to the customer in the private comment. Once collected, we'll set needinfo.
- external trackers