-
Bug
-
Resolution: Unresolved
-
Normal
-
None
-
None
-
3
-
False
-
-
False
-
?
-
tripleo-ansible-3.3.1-17.1.20250728124106.8debef3.el9ost
-
None
-
-
-
-
Pending Verification, Storage Integration Sprint 5
-
2
-
Moderate
ganesha-nfs is exhausting pids_limit on startup on large environments with lots of shares/clients . This should be fixed with this commit .
Jul 04 20:38:13 overcloud-controller-0 ceph-nfs-pacemaker[10754]: 04/07/2025 20:38:13 : epoch 68681f95 : overcloud-controller-0 : ganesha.nfsd-1[main] main :MAIN :EVENT :ganesha.nfsd Starting: Ganesha Version 5.7-2 Jul 04 20:38:13 overcloud-controller-0 systemd[1]: Started Cluster Controlled ceph-nfs@pacemaker. Jul 04 20:38:14 overcloud-controller-0 pacemaker-controld[4581]: notice: Result of start operation for ceph-nfs on overcloud-controller-0: ok Jul 04 20:38:14 overcloud-controller-0 pacemaker-controld[4581]: notice: Requesting local execution of monitor operation for ceph-nfs on overcloud-controller-0 Jul 04 20:38:14 overcloud-controller-0 pacemaker-controld[4581]: notice: Result of monitor operation for ceph-nfs on overcloud-controller-0: ok Jul 04 20:38:14 overcloud-controller-0 ceph-nfs-pacemaker[10754]: 04/07/2025 20:38:14 : epoch 68681f95 : overcloud-controller-0 : ganesha.nfsd-1[main] nfs_set_param_from_conf :NFS STARTUP :EVENT :Configuration file successfully parsed Jul 04 20:38:14 overcloud-controller-0 ceph-nfs-pacemaker[10754]: 04/07/2025 20:38:14 : epoch 68681f95 : overcloud-controller-0 : ganesha.nfsd-1[main] monitoring_init :NFS STARTUP :EVENT :Init monitoring at 0.0.0.0:9587 Jul 04 20:38:14 overcloud-controller-0 ceph-nfs-pacemaker[10754]: 04/07/2025 20:38:14 : epoch 68681f95 : overcloud-controller-0 : ganesha.nfsd-1[main] fsal_init_fds_limit :MDCACHE LRU :EVENT :Setting the system-imposed limit on FDs to 1048576. Jul 04 20:38:14 overcloud-controller-0 ceph-nfs-pacemaker[10754]: 04/07/2025 20:38:14 : epoch 68681f95 : overcloud-controller-0 : ganesha.nfsd-1[main] init_server_pkgs :NFS STARTUP :EVENT :Initializing ID Mapper. Jul 04 20:38:14 overcloud-controller-0 ceph-nfs-pacemaker[10754]: 04/07/2025 20:38:14 : epoch 68681f95 : overcloud-controller-0 : ganesha.nfsd-1[main] init_server_pkgs :NFS STARTUP :EVENT :ID Mapper successfully initialized. Jul 04 20:38:14 overcloud-controller-0 ceph-nfs-pacemaker[10754]: 04/07/2025 20:38:14 : epoch 68681f95 : overcloud-controller-0 : ganesha.nfsd-1[main] rados_kv_init :CLIENT ID :EVENT :Rados kv store init done Jul 04 20:38:14 overcloud-controller-0 ceph-nfs-pacemaker[10754]: 04/07/2025 20:38:14 : epoch 68681f95 : overcloud-controller-0 : ganesha.nfsd-1[main] nfs_start_grace :STATE :EVENT :NFS Server Now IN GRACE, duration 90 Jul 04 20:38:14 overcloud-controller-0 ceph-nfs-pacemaker[10754]: 04/07/2025 20:38:14 : epoch 68681f95 : overcloud-controller-0 : ganesha.nfsd-1[main] nfs_start_grace :STATE :EVENT :grace reload client info completed from backend Jul 04 20:38:14 overcloud-controller-0 ceph-nfs-pacemaker[10754]: 04/07/2025 20:38:14 : epoch 68681f95 : overcloud-controller-0 : ganesha.nfsd-1[main] nfs_try_lift_grace :STATE :EVENT :check grace:reclaim complete(0) clid count(0) Jul 04 20:42:38 overcloud-controller-0 ceph-nfs-pacemaker[10754]: 04/07/2025 20:42:38 : epoch 68681f95 : overcloud-controller-0 : ganesha.nfsd-1[main] reclaim_reset :FSAL :EVENT :start_reclaim failed: No such file or directory Jul 04 20:42:38 overcloud-controller-0 ceph-nfs-pacemaker[10754]: 04/07/2025 20:42:38 : epoch 68681f95 : overcloud-controller-0 : ganesha.nfsd-1[main] reclaim_reset :FSAL :EVENT :start_reclaim failed: No such file or directory Jul 04 20:42:38 overcloud-controller-0 ceph-nfs-pacemaker[10754]: 04/07/2025 20:42:38 : epoch 68681f95 : overcloud-controller-0 : ganesha.nfsd-1[main] reclaim_reset :FSAL :EVENT :start_reclaim failed: No such file or directory Jul 04 20:42:38 overcloud-controller-0 ceph-nfs-pacemaker[10754]: 04/07/2025 20:42:38 : epoch 68681f95 : overcloud-controller-0 : ganesha.nfsd-1[main] reclaim_reset :FSAL :EVENT :start_reclaim failed: No such file or directory Jul 04 20:42:38 overcloud-controller-0 ceph-nfs-pacemaker[10754]: 04/07/2025 20:42:38 : epoch 68681f95 : overcloud-controller-0 : ganesha.nfsd-1[main] reclaim_reset :FSAL :EVENT :start_reclaim failed: No such file or directory Jul 04 20:42:38 overcloud-controller-0 ceph-nfs-pacemaker[10754]: 04/07/2025 20:42:38 : epoch 68681f95 : overcloud-controller-0 : ganesha.nfsd-1[main] reclaim_reset :FSAL :EVENT :start_reclaim failed: No such file or directory Jul 04 20:42:38 overcloud-controller-0 ceph-nfs-pacemaker[10754]: 04/07/2025 20:42:38 : epoch 68681f95 : overcloud-controller-0 : ganesha.nfsd-1[main] reclaim_reset :FSAL :EVENT :start_reclaim failed: No such file or directory Jul 04 20:42:38 overcloud-controller-0 ceph-nfs-pacemaker[10754]: Thread::try_create(): pthread_create failed with error 11/builddir/build/BUILD/ceph-18.2.1/src/common/Thread.cc: In function 'void Thread::create(const char*, size_t)' thread 7fd7ffbaa200 time 2025-07-04T20:42:38.198162+0200 Jul 04 20:42:38 overcloud-controller-0 ceph-nfs-pacemaker[10754]: /builddir/build/BUILD/ceph-18.2.1/src/common/Thread.cc: 165: FAILED ceph_assert(ret == 0) Jul 04 20:42:38 overcloud-controller-0 ceph-nfs-pacemaker[10754]: ceph version 18.2.1-262.el9cp (4857b2aad4c3aaa8ff58e0b60396fa6ab731f9ff) reef (stable) Jul 04 20:42:38 overcloud-controller-0 ceph-nfs-pacemaker[10754]: 1: (ceph::__ceph_assert_fail(char const*, char const*, int, char const)+0x12e) [0x7fd7ff1cf1c9] Jul 04 20:42:38 overcloud-controller-0 ceph-nfs-pacemaker[10754]: 2: /usr/lib64/ceph/libceph-common.so.2(+0x16e387) [0x7fd7ff1cf387] Jul 04 20:42:38 overcloud-controller-0 ceph-nfs-pacemaker[10754]: 3: (Thread::create(char const, unsigned long)+0xbc) [0x7fd7ff2c1c8c] Jul 04 20:42:38 overcloud-controller-0 ceph-nfs-pacemaker[10754]: 4: /lib64/libcephfs.so.2(+0xfb182) [0x7fd7fe1dd182] Jul 04 20:42:38 overcloud-controller-0 ceph-nfs-pacemaker[10754]: 5: /lib64/libcephfs.so.2(+0x4c593) [0x7fd7fe12e593] Jul 04 20:42:38 overcloud-controller-0 ceph-nfs-pacemaker[10754]: 6: /usr/lib64/ganesha/libfsalceph.so(+0xe3fb) [0x7fd7fecf73fb] Jul 04 20:42:38 overcloud-controller-0 ceph-nfs-pacemaker[10754]: 7: /lib64/libganesha_nfsd.so.5.7(+0x138bbd) [0x7fd800d8abbd] Jul 04 20:42:38 overcloud-controller-0 ceph-nfs-pacemaker[10754]: 8: /lib64/libganesha_nfsd.so.5.7(+0xa74cb) [0x7fd800cf94cb] Jul 04 20:42:38 overcloud-controller-0 ceph-nfs-pacemaker[10754]: 9: /lib64/libganesha_nfsd.so.5.7(+0x6b142) [0x7fd800cbd142] Jul 04 20:42:38 overcloud-controller-0 ceph-nfs-pacemaker[10754]: 10: /lib64/libganesha_nfsd.so.5.7(+0x6adc9) [0x7fd800cbcdc9] Jul 04 20:42:38 overcloud-controller-0 ceph-nfs-pacemaker[10754]: 11: load_config_from_parse() Jul 04 20:42:38 overcloud-controller-0 ceph-nfs-pacemaker[10754]: 12: ReadExports() Jul 04 20:42:38 overcloud-controller-0 ceph-nfs-pacemaker[10754]: 13: main() Jul 04 20:42:38 overcloud-controller-0 ceph-nfs-pacemaker[10754]: 14: /lib64/libc.so.6(+0x29590) [0x7fd800a72590] Jul 04 20:42:38 overcloud-controller-0 ceph-nfs-pacemaker[10754]: 15: __libc_start_main() Jul 04 20:42:38 overcloud-controller-0 ceph-nfs-pacemaker[10754]: 16: _start() Jul 04 20:42:38 overcloud-controller-0 kernel: traps: ganesha.nfsd[10756] general protection fault ip:7fd800a71898 sp:7ffd55e37680 error:0 in libc.so.6[7fd800a71000+175000] Jul 04 20:42:59 overcloud-controller-0 systemd-coredump[143507]: Resource limits disable core dumping for process 10756 (ganesha.nfsd). Jul 04 20:42:59 overcloud-controller-0 systemd-coredump[143507]: Process 10756 (ganesha.nfsd) of user 0 dumped core. The kernel dmesg has messages like this:[19310.828096] cgroup: fork rejected by pids controller in /machine.slice/libpod-b27a9b998a20cab853cedf9e238718fb2da9960f9265ba8c04e9294901c106c7.scope/container [19612.533615] cgroup: fork rejected by pids controller in /machine.slice/libpod-58195961de592f477e86955b7a45509424a8d6765f1bfdcfcfb2edbd6d7cd282.scope/container [19914.005506] cgroup: fork rejected by pids controller in /machine.slice/libpod-da0321c18ead8689836a72f0833d919e1debb08b256e98e508819111f0caa918.scope/container This might indicate that there are resource limits imposed on the ganesha service, thus the ceph FSAL cannot create a new thread (error 11 - EAGAIN) from libcephfs (stack item 3) and then asserts as it cannot do without that thread. So it seems as if libcephfs is not malfunctioning but starved from required resources.
- links to