-
Bug
-
Resolution: Unresolved
-
Critical
-
None
-
odf-4.13
-
None
Description of problem (please be as detailed as possible and provide log snippets):
One of the following two commands caused a severe storage outage/impact cluster-wide.
$ ceph --admin-daemon /var/run/ceph/<ceph-mds>.asok status
$ ceph tell mds.<mds-name> status
The above two commands either run by the customer or something in ODF caused the MDSs to crash and enter a journal replay state for an excessive amount of time causing ceph-backed workloads to go down and were difficult to get back up/running.
the MDS crashed while handling the status admin socket command.
2024-05-27T13:09:42.318488158Z debug -8> 2024-05-27T13:09:42.222+0000 7f7dbf659640 5 mds.0.log _submit_thread 39083975766566~2061 : EUpdate openc [metablob 0x1005b7abe3a, 2 dirs]
2024-05-27T13:09:42.318488158Z debug -7> 2024-05-27T13:09:42.222+0000 7f7dc6667640 4 mds.0.server handle_client_request client_request(client.37787558:180979905 create #0x10012d40363/_7zlb6.nvd 2024-05-27T13:09:40.965503+0000 caller_uid=1001000000, caller_gid=0
) v4
2024-05-27T13:09:42.318488158Z debug -6> 2024-05-27T13:09:42.222+0000 7f7dbf659640 5 mds.0.log _submit_thread 39083975768647~2589 : EUpdate openc [metablob 0x10012d40362, 2 dirs]
2024-05-27T13:09:42.318507432Z debug -5> 2024-05-27T13:09:42.222+0000 7f7dc6667640 3 mds.0.server handle_client_session client_session(request_renewcaps seq 324635) from client.38893383
2024-05-27T13:09:42.318507432Z debug -4> 2024-05-27T13:09:42.222+0000 7f7dc6667640 3 mds.0.server handle_client_session client_session(request_renewcaps seq 324635) from client.38893386
2024-05-27T13:09:42.318507432Z debug -3> 2024-05-27T13:09:42.222+0000 7f7dc6667640 3 mds.0.server handle_client_session client_session(request_renewcaps seq 324635) from client.38893389
2024-05-27T13:09:42.318527336Z debug -2> 2024-05-27T13:09:42.222+0000 7f7dc6667640 3 mds.0.server handle_client_session client_session(request_renewcaps seq 324635) from client.39148864
2024-05-27T13:09:42.318527336Z debug -1> 2024-05-27T13:09:42.222+0000 7f7dc6667640 4 mds.0.server handle_client_request client_request(client.39148870:24017996 getattr AsLsXsFs #0x100b88c1123 2024-05-27T13:09:40.311495+0000 caller_uid=1000980000, caller_gid=501
) v4
2024-05-27T13:09:42.318527336Z debug 0> 2024-05-27T13:09:42.222+0000 7f7dc866b640 -1 *** Caught signal (Segmentation fault) **
2024-05-27T13:09:42.318527336Z in thread 7f7dc866b640 thread_name:admin_socket
2024-05-27T13:09:42.318527336Z
2024-05-27T13:09:42.318527336Z ceph version 17.2.6-170.el9cp (59bbeb8815ec3aeb3c8bba1e1866f8f6729eb840) quincy (stable)
2024-05-27T13:09:42.318527336Z 1: /lib64/libc.so.6(+0x54db0) [0x7f7dcb503db0]
2024-05-27T13:09:42.318527336Z 2: (MDSDaemon::dump_status(ceph::Formatter*)+0x2f6) [0x557b1b442826]
2024-05-27T13:09:42.318527336Z 3: (MDSDaemon::asok_command(std::basic_string_view<char, std::char_traits<char> >, std::map<std::_cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >, boost::variant<std::cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >, bool, long, double, std::vector<std::cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >, std::allocator<std::cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > > >, std::vector<long, std::allocator<long> >, std::vector<double, std::allocator<double> > >, std::less<void>, std::allocator<std::pair<std::cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const, boost::variant<std::cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >, bool, long, double, std::vector<std::cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >, std::allocator<std::cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > > >, std::vector<long, std::allocator<long> >, std::vector<double, std::allocator<double> > > > > > const&, ceph::Formatter*, ceph::buffer::v15_2_0::list const&, std::function<void (int, std::_cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&, ceph::buffer::v15_2_0::list&)>)+0x590) [0x557b1b443f40]
2024-05-27T13:09:42.318527336Z 4: ceph-mds(+0x12f7f8) [0x557b1b4447f8]
2024-05-27T13:09:42.318527336Z 5: (AdminSocket::execute_command(std::vector<std::_cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >, std::allocator<std::cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > > > const&, ceph::buffer::v15_2_0::list const&, std::function<void (int, std::_cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&, ceph::buffer::v15_2_0::list&)>)+0x57a) [0x7f7dcbc56d0a]
2024-05-27T13:09:42.318527336Z 6: (AdminSocket::execute_command(std::vector<std::_cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >, std::allocator<std::_cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > > > const&, ceph::buffer::v15_2_0::list const&, std::ostream&, ceph::buffer::v15_2_0::list*)+0x11a) [0x7f7dcbc577aa]
2024-05-27T13:09:42.318527336Z 7: (AdminSocket::do_accept()+0x2b6) [0x7f7dcbc5a976]
2024-05-27T13:09:42.318527336Z 8: (AdminSocket::entry()+0x488) [0x7f7dcbc5b7a8]
2024-05-27T13:09:42.318527336Z 9: /lib64/libstdc++.so.6(+0xdb924) [0x7f7dcb88b924]
2024-05-27T13:09:42.318527336Z 10: /lib64/libc.so.6(+0x9f802) [0x7f7dcb54e802]
2024-05-27T13:09:42.318527336Z 11: /lib64/libc.so.6(+0x3f450) [0x7f7dcb4ee450]
2024-05-27T13:09:42.318527336Z NOTE: a copy of the executable, or `objdump -rdS <executable>` is needed to interpret this.
The cluster has since recovered, however, someone or something (customer or ODF resource), ran the admin socket or $ ceph tell mds.<name> status command which caused this crash. The expected behavior is that the MDS should not crash while handling the status command and therefore there is an inconsistency in the status command.
Version of all relevant components (if applicable):
OCP:
NAME VERSION AVAILABLE PROGRESSING SINCE STATUS
version 4.13.24 True False 15d Cluster version is 4.13.24
ODF:
NAME DISPLAY VERSION REPLACES PHASE
mcg-operator.v4.13.7-rhodf NooBaa Operator 4.13.7-rhodf mcg-operator.v4.12.11-rhodf Succeeded
ocs-operator.v4.13.7-rhodf OpenShift Container Storage 4.13.7-rhodf ocs-operator.v4.12.11-rhodf Succeeded
odf-csi-addons-operator.v4.13.7-rhodf CSI Addons 4.13.7-rhodf odf-csi-addons-operator.v4.12.11-rhodf Succeeded
odf-operator.v4.13.7-rhodf OpenShift Data Foundation 4.13.7-rhodf odf-operator.v4.12.11-rhodf Succeeded
Ceph:
{
"mon":
,
"mgr":
,
"osd":
,
"mds":
,
"overall":
}
Does this issue impact your ability to continue to work with the product (please explain in detail what is the user impact)?
This was a heavily escalated production cluster. Even after we were able to get the MDSs stable, there were so many workloads affected that it took a lot of manual intervention in those namespaces to (deleting pods, scaling workloads, etc.) to get those workloads back online again.
Is there any workaround available to the best of your knowledge?
No
Rate from 1 - 5 the complexity of the scenario you performed that caused this bug (1 - very simple, 5 - very complex)?
4
Additional info:
(See Private Comment)
- external trackers