-
Bug
-
Resolution: Unresolved
-
Undefined
-
None
-
None
-
None
-
False
-
-
False
-
?
-
?
-
?
-
?
-
-
-
None
Description of problem - Provide a detailed description of the issue encountered, including logs/command-output snippets and screenshots if the issue is observed in the UI:
This is coming from https://issues.redhat.com/browse/DFBUGS-858
Current status is that ceph filesystem is damaged :
sh-5.1$ ceph -s cluster: id: 27eddc2c-03dd-42e2-8672-b52e6b2d3aa9 health: HEALTH_ERR mons are allowing insecure global_id reclaim 1 filesystem is degraded 1 filesystem is offline 1 mds daemon damaged 36 daemons have recently crashed services: mon: 3 daemons, quorum b,c,e (age 58m) mgr: a(active, since 37m), standbys: b mds: 0/1 daemons up, 1 standby osd: 12 osds: 12 up (since 88m), 12 in (since 3d) data: volumes: 0/1 healthy, 1 recovering; 1 damaged pools: 12 pools, 281 pgs objects: 46.07M objects, 4.4 TiB usage: 14 TiB used, 34 TiB / 48 TiB avail pgs: 279 active+clean 2 active+clean+scrubbing+deep
and mds are not able to become active (start as standby) ,
Tried to follow:
https://access.redhat.com/solutions/6889971
ceph mds repaired ocp7-storagecluster-cephfilesystem:0
But that did not help, FS remains damaged
Then tried: https://access.redhat.com/solutions/6123271
but failed to collect first command:
sh-5.1$ ceph tell mds.ocp7-storagecluster-cephfilesystem:0 damage ls terminate called after throwing an instance of ‘std::out_of_range’ what(): map::at Aborted (core dumped)
also tried to repair, but got the same error:
sh-5.1$ ceph tell mds.ocp7-storagecluster-cephfilesystem:0 scrub start / force repair terminate called after throwing an instance of ‘std::out_of_range’ what(): map::at Aborted (core dumped)
The OCP platform infrastructure and deployment type (AWS, Bare Metal, VMware, etc. Please clarify if it is platform agnostic deployment), (IPI/UPI):
VMware
The ODF deployment type (Internal, External, Internal-Attached (LSO), Multicluster, DR, Provider, etc):
ODF 4.15.7 Internal mode
The version of all relevant components (OCP, ODF, RHCS, ACM whichever is applicable):
Does this issue impact your ability to continue to work with the product?
Yes
Is there any workaround available to the best of your knowledge?
No
Can this issue be reproduced? If so, please provide the hit rate
Not sure, problem started when customer deleted by mistake three mon and PVCs.
Can this issue be reproduced from the UI?
N/A
If this is a regression, please provide more details to justify this:
Steps to Reproduce:
see above
The exact date and time when the issue was observed, including timezone details:
15 nov
Actual results:
Cannot fix cephfs FS.
Expected results:
Fix cephfs FS
Logs collected and log location:
Additional info: