-
Bug
-
Resolution: Unresolved
-
Major
-
6.16.0
Description of problem:
When having library and few other lifecycles added with capsule for content syncing and then several repo sync, CV publish etc tasks are being performed concurrently on satellite, That is generating many capsule sync tasks and due to the same, very frequently we notice that pulpcore-api is dumping core along with some selinux denials reported.
For a capsule with enough available memory, This may not lead to any task failures but for a capsule that does not have enough memory, may end up killing the pulpcore worker shortly after the core dump happens for pulpcore-api, resulting in the entire capsule sync to fail.
How reproducible:
Always under specific circumstances
Is this issue a regression from an earlier version:
Seems so as i cannot reproduce the same behavior on 6.15
Steps to Reproduce:
1. Install a satellite ( also import manifest ) and capsule 6.16
* Satellite should have 20 GB ram and 4 vCPUs
* Capsule should have 12 GB ram and 2 or 4 vCPUs
2. Follow https://access.redhat.com/solutions/2139111 on capsule to ensure that coredump is properly configured at systemd level.
3. Create a lifecycle path in satellite that looks like Library --> QA --> Prod
4. Associate all those lifecycles on the target capsule
5. Enable the following repos in the satellite and sync them as concurrently as possible.
Red Hat CodeReady Linux Builder for RHEL 8 x86_64 RPMs 8 Red Hat CodeReady Linux Builder for RHEL 9 x86_64 RPMs 9 Red Hat Discovery 1 for RHEL 8 x86_64 RPMs Red Hat Discovery 1 for RHEL 9 x86_64 RPMs Red Hat Enterprise Linux 7 Server - Extended Life Cycle Support RPMs x86_64 Red Hat Enterprise Linux 7 Server - Extras RPMs x86_64 Red Hat Enterprise Linux 7 Server Kickstart x86_64 7.9 Red Hat Enterprise Linux 7 Server RPMs x86_64 7Server Red Hat Enterprise Linux 8 for x86_64 - AppStream - Extended Update Support RPMs 8.8 Red Hat Enterprise Linux 8 for x86_64 - AppStream Kickstart 8.10 Red Hat Enterprise Linux 8 for x86_64 - AppStream Kickstart 8.9 Red Hat Enterprise Linux 8 for x86_64 - AppStream RPMs 8 Red Hat Enterprise Linux 8 for x86_64 - AppStream RPMs 8.10 Red Hat Enterprise Linux 8 for x86_64 - AppStream RPMs 8.8 Red Hat Enterprise Linux 8 for x86_64 - AppStream RPMs 8.9 Red Hat Enterprise Linux 8 for x86_64 - BaseOS - Extended Update Support RPMs 8.8 Red Hat Enterprise Linux 8 for x86_64 - BaseOS Kickstart 8.10 Red Hat Enterprise Linux 8 for x86_64 - BaseOS Kickstart 8.9 Red Hat Enterprise Linux 8 for x86_64 - BaseOS RPMs 8 Red Hat Enterprise Linux 8 for x86_64 - BaseOS RPMs 8.10 Red Hat Enterprise Linux 8 for x86_64 - BaseOS RPMs 8.8 Red Hat Enterprise Linux 8 for x86_64 - BaseOS RPMs 8.9 Red Hat Enterprise Linux 9 for x86_64 - AppStream - Extended Update Support RPMs 9.4 Red Hat Enterprise Linux 9 for x86_64 - AppStream Kickstart 9.3 Red Hat Enterprise Linux 9 for x86_64 - AppStream Kickstart 9.4 Red Hat Enterprise Linux 9 for x86_64 - AppStream RPMs 9 Red Hat Enterprise Linux 9 for x86_64 - AppStream RPMs 9.3 Red Hat Enterprise Linux 9 for x86_64 - AppStream RPMs 9.4 Red Hat Enterprise Linux 9 for x86_64 - BaseOS - Extended Update Support RPMs 9.4 Red Hat Enterprise Linux 9 for x86_64 - BaseOS Kickstart 9.3 Red Hat Enterprise Linux 9 for x86_64 - BaseOS Kickstart 9.4 Red Hat Enterprise Linux 9 for x86_64 - BaseOS RPMs 9 Red Hat Enterprise Linux 9 for x86_64 - BaseOS RPMs 9.3 Red Hat Enterprise Linux 9 for x86_64 - BaseOS RPMs 9.4 Red Hat Satellite Client 6 for RHEL 7 Server - ELS RPMS x86_64 Red Hat Satellite Client 6 for RHEL 8 x86_64 RPMs Red Hat Satellite Client 6 for RHEL 9 x86_64 RPMs
6. Create , publish and promote these CVs one after another.
RHEL7 -> All RHEL 7 repos added , published and promoted to all lifecycles RHEL8 --> All RHEL 8 non-eus repos added, published and promoted to all lifecycles RHEL9 --> All RHEL 9 non-eus repos added, published and promoted to all lifecycles RHEL8-EUS --> All RHEL 8 EUS repos added , published and promoted to all lifecycles RHEL9-EUS --> All RHEL 9 EUS repos added , published and promoted to all lifecycles
7. Go to the capsule server and check this i.e.
# egrep "dumped core|SELinux is preventing" /var/log/messages -A10 -B10
Actual behavior:
Without coredump properly configured we will get to see something like this very frequently i.e.
Sep 9 12:05:35 capsule616 systemd[1]: Created slice system-systemd\x2dcoredump.slice. Sep 9 12:05:35 capsule616 systemd[1]: Started Process Core Dump (PID 15853/UID 0). Sep 9 12:05:35 capsule616 pulpcore-api[15576]: pulp [c6ae74c6-46bc-43c9-aa9d-8ec1637ce2d9]: - - [09/Sep/2024:06:35:35 +0000] "GET /pulp/api/v3/tasks/0191d57d-e93d-764b-88da-638df4daf2be/ HTTP/1.1" 200 629 "-" "OpenAPI-Generator/3.49.17/ruby" Sep 9 12:05:35 capsule616 systemd-coredump[15854]: Resource limits disable core dumping for process 9864 (pulpcore-api). Sep 9 12:05:35 capsule616 systemd-coredump[15854]: Process 9864 (pulpcore-api) of user 988 dumped core. Sep 9 12:05:35 capsule616 systemd[1]: systemd-coredump@0-15853-0.service: Succeeded. Sep 9 12:05:35 capsule616 pulpcore-api[9832]: [2024-09-09 06:35:35 +0000] [9832] [ERROR] Worker (pid:9864) was sent code 134! Sep 9 12:05:35 capsule616 pulpcore-api[15862]: [2024-09-09 06:35:35 +0000] [15862] [INFO] Booting worker with pid: 15862 Sep 9 12:05:35 capsule616 pulpcore-api[15862]: pulp [None]: pulpcore.app.entrypoint:INFO: Api App '15862@capsule616.example.com' failed to write a heartbeat to the database, sleeping for '45.0' seconds. -- Sep 9 12:06:12 capsule616 pulpcore-api[9867]: [2024-09-09 06:36:12 +0000] [9867] [INFO] Autorestarting worker after current request. Sep 9 12:06:12 capsule616 pulpcore-api[9867]: pulp [c6ae74c6-46bc-43c9-aa9d-8ec1637ce2d9]: - - [09/Sep/2024:06:36:12 +0000] "GET /pulp/api/v3/tasks/0191d576-402a-7922-89d1-914cb48b07e3/ HTTP/1.1" 200 629 "-" "OpenAPI-Generator/3.49.17/ruby" Sep 9 12:06:12 capsule616 pulpcore-api[9867]: [2024-09-09 06:36:12 +0000] [9867] [INFO] Worker exiting (pid: 9867) Sep 9 12:06:12 capsule616 systemd[1]: Started Process Core Dump (PID 16055/UID 0). Sep 9 12:06:12 capsule616 systemd-coredump[16056]: Resource limits disable core dumping for process 9867 (pulpcore-api). Sep 9 12:06:12 capsule616 systemd-coredump[16056]: Process 9867 (pulpcore-api) of user 988 dumped core. Sep 9 12:06:13 capsule616 pulpcore-api[9832]: [2024-09-09 06:36:12 +0000] [9832] [ERROR] Worker (pid:9867) was sent code 134! Sep 9 12:06:13 capsule616 pulpcore-api[16057]: [2024-09-09 06:36:12 +0000] [16057] [INFO] Booting worker with pid: 16057 Sep 9 12:06:13 capsule616 pulpcore-api[16057]: pulp [None]: pulpcore.app.entrypoint:INFO: Api App '16057@capsule616.example.com' failed to write a heartbeat to the database, sleeping for '45.0' seconds. Sep 9 12:06:13 capsule616 systemd[1]: systemd-coredump@1-16055-0.service: Succeeded. Sep 9 12:06:13 capsule616 pulpcore-api[16057]: pulp [c6ae74c6-46bc-43c9-aa9d-8ec1637ce2d9]: - - [09/Sep/2024:06:36:13 +0000] "GET /pulp/api/v3/tasks/0191d57e-e082-73aa-9326-5b07ace2e022/ HTTP/1.1" 200 629 "-" "OpenAPI-Generator/3.49.17/ru
With coredump configured properly, we will get to see something like this :
Sep 13 17:46:09 capsule616 systemd[1]: Listening on Pulp Content App socket. Sep 13 17:46:09 capsule616 systemd[1]: Starting Pulp Content App... Sep 13 17:46:09 capsule616 systemd[1]: Started Process Core Dump (PID 138194/UID 0). Sep 13 17:46:10 capsule616 systemd-coredump[138195]: Process 137796 (pulpcore-api) of user 988 dumped core.#012#012Stack trace of thread 137796:#012#0 0x00007f222cbde52f raise (libc.so.6)#012#1 0x00007f222cbb1e65 abort (libc.so.6)#012#2 0x00007f2219090907 ffi_closure_free.cold.16 (libffi.so.6)#012#3 0x00007f22192a008c CThunkObject_dealloc (_ctypes.cpython-311-x86_64-linux-gnu.so)#012#4 0x00007f222dafa5c7 dict_dealloc (libpython3.11.so.1.0)#012#5 0x00007f22192aa1a1 PyCData_clear (_ctypes.cpython-311-x86_64-linux-gnu.so)#012#6 0x00007f221929f7c0 PyCFuncPtr_dealloc (_ctypes.cpython-311-x86_64-linux-gnu.so)#012#7 0x00007f222db50388 subtype_dealloc (libpython3.11.so.1.0)#012#8 0x00007f222dae763c free_keys_object (libpython3.11.so.1.0)#012#9 0x00007f222dae7c79 dict_tp_clear (libpython3.11.so.1.0)#012#10 0x00007f222db0e234 gc_collect_main (libpython3.11.so.1.0)#012#11 0x00007f222dc52df0 _PyGC_CollectNoFail (libpython3.11.so.1.0)#012#12 0x00007f222dbbd7e3 finalize_modules (libpython3.11.so.1.0)#012#13 0x00007f222dc60400 Py_FinalizeEx (libpython3.11.so.1.0)#012#14 0x00007f222dc604f8 Py_Exit (libpython3.11.so.1.0)#012#15 0x00007f222dc6053b handle_system_exit (libpython3.11.so.1.0)#012#16 0x00007f222dc60578 PyErr_PrintEx (libpython3.11.so.1.0)#012#17 0x00007f222dac4eee _PyRun_SimpleFileObject.cold.3748 (libpython3.11.so.1.0)#012#18 0x00007f222dc6ba14 _PyRun_AnyFileObject (libpython3.11.so.1.0)#012#19 0x00007f222dc6bd1e Py_RunMain (libpython3.11.so.1.0)#012#20 0x00007f222dc6c9e9 Py_BytesMain (libpython3.11.so.1.0)#012#21 0x00007f222cbca7e5 __libc_start_main (libc.so.6)#012#22 0x00005556843ec74e _start (python3.11) Sep 13 17:46:10 capsule616 systemd[1]: systemd-coredump@67-138194-0.service: Succeeded. Sep 13 17:46:10 capsule616 systemd[1]: pulpcore-api.service: Main process exited, code=dumped, status=6/ABRT Sep 13 17:46:10 capsule616 systemd[1]: pulpcore-api.service: Failed with result 'core-dump'. Sep 13 17:46:10 capsule616 systemd[1]: Stopped Pulp API Server. Sep 13 17:46:10 capsule616 systemd[1]: pulpcore-api.socket: Succeeded. Sep 13 17:46:10 capsule616 systemd[1]: Closed Pulp API Server socket. Sep 13 17:46:10 capsule616 systemd[1]: Stopping Pulp API Server socket. Sep 13 17:46:10 capsule616 systemd[1]: Starting Pulp API Server socket. Sep 13 17:46:10 capsule616 systemd[1]: Listening on Pulp API Server socket. Sep 13 17:46:10 capsule616 systemd[1]: Starting Pulp API Server... Sep 13 17:46:11 capsule616 pulpcore-content[138191]: [2024-09-13 12:16:11 +0000] [138191] [INFO] Starting gunicorn 22.0.0 Sep 13 17:46:11 capsule616 pulpcore-content[138191]: [2024-09-13 12:16:11 +0000] [138191] [INFO] Listening at: unix:/run/pulpcore-content.sock (138191) Sep 13 17:46:11 capsule616 pulpcore-content[138191]: [2024-09-13 12:16:11 +0000] [138191] [INFO] Using worker: aiohttp.GunicornWebWorker Sep 13 17:46:11 capsule616 systemd[1]: Started Pulp Content App. Sep 13 17:46:11 capsule616 pulpcore-content[138212]: [2024-09-13 12:16:11 +0000] [138212] [INFO] Booting worker with pid: 138212 Sep 13 17:46:11 capsule616 pulpcore-content[138220]: [2024-09-13 12:16:11 +0000] [138220] [INFO] Booting worker with pid: 138220 Sep 13 17:46:11 capsule616 /SetroubleshootPrivileged.py[138161]: failed to retrieve rpm info for /var/lib/selinux/targeted/active/modules/400/pulpcore Sep 13 17:46:11 capsule616 setroubleshoot[138124]: SELinux is preventing /usr/bin/python3.11 from write access on the file memfd:libffi. For complete SELinux messages run: sealert -l a13c222c-5722-4c71-b3f0-570e358d83c9 Sep 13 17:46:11 capsule616 setroubleshoot[138124]: SELinux is preventing /usr/bin/python3.11 from write access on the file memfd:libffi.#012#012***** Plugin catchall (100. confidence) suggests **************************#012#012If you believe that python3.11 should be allowed write access on the memfd:libffi file by default.#012Then you should report this as a bug.#012You can generate a local policy module to allow this access.#012Do#012allow this access for now by executing:#012# ausearch -c 'pulpcore-api' --raw | audit2allow -M my-pulpcoreapi#012# semodule -X 300 -i my-pulpcoreapi.pp#012 Sep 13 17:46:12 capsule616 pulpcore-content[138228]: [2024-09-13 12:16:12 +0000] [138228] [INFO] Booting worker with pid: 138228 Sep 13 17:46:12 capsule616 pulpcore-content[138229]: [2024-09-13 12:16:12 +0000] [138229] [INFO] Booting worker with pid: 138229 Sep 13 17:46:12 capsule616 pulpcore-content[138231]: [2024-09-13 12:16:12 +0000] [138231] [INFO] Booting worker with pid: 138231 Sep 13 17:46:12 capsule616 /SetroubleshootPrivileged.py[138161]: failed to retrieve rpm info for /var/lib/selinux/targeted/active/modules/400/pulpcore Sep 13 17:46:12 capsule616 setroubleshoot[138124]: SELinux is preventing /usr/bin/python3.11 from map access on the file /memfd:libffi (deleted). For complete SELinux messages run: sealert -l e9a8d660-8c95-4ee9-b966-d98fa04a677a Sep 13 17:46:12 capsule616 setroubleshoot[138124]: SELinux is preventing /usr/bin/python3.11 from map access on the file /memfd:libffi (deleted).#012#012***** Plugin restorecon (92.2 confidence) suggests ************************#012#012If you want to fix the label. #012/memfd:libffi (deleted) default label should be etc_runtime_t.#012Then you can run restorecon. The access attempt may have been stopped due to insufficient permissions to access a parent directory in which case try to change the following command accordingly.#012Do#012# /sbin/restorecon -v /memfd:libffi (deleted)#012#012***** Plugin catchall_boolean (7.83 confidence) suggests ******************#012#012If you want to allow domain to can mmap files#012Then you must tell SELinux about this by enabling the 'domain_can_mmap_files' boolean.#012#012Do#012setsebool -P domain_can_mmap_files 1#012#012***** Plugin catchall (1.41 confidence) suggests **************************#012#012If you believe that python3.11 should be allowed map access on the memfd:libffi (deleted) file by default.#012Then you should report this as a bug.#012You can generate a local policy module to allow this access.#012Do#012allow this access for now by executing:#012# ausearch -c 'pulpcore-api' --raw | audit2allow -M my-pulpcoreapi#012# semodule -X 300 -i my-pulpcoreapi.pp#012 Sep 13 17:46:12 capsule616 /SetroubleshootPrivileged.py[138161]: failed to retrieve rpm info for /var/lib/selinux/targeted/active/modules/400/pulpcore Sep 13 17:46:12 capsule616 setroubleshoot[138124]: SELinux is preventing /usr/bin/python3.11 from map access on the file /memfd:libffi (deleted). For complete SELinux messages run: sealert -l e9a8d660-8c95-4ee9-b966-d98fa04a677a Sep 13 17:46:12 capsule616 setroubleshoot[138124]: SELinux is preventing /usr/bin/python3.11 from map access on the file /memfd:libffi (deleted).#012#012***** Plugin restorecon (92.2 confidence) suggests ************************#012#012If you want to fix the label. #012/memfd:libffi (deleted) default label should be etc_runtime_t.#012Then you can run restorecon. The access attempt may have been stopped due to insufficient permissions to access a parent directory in which case try to change the following command accordingly.#012Do#012# /sbin/restorecon -v /memfd:libffi (deleted)#012#012***** Plugin catchall_boolean (7.83 confidence) suggests ******************#012#012If you want to allow domain to can mmap files#012Then you must tell SELinux about this by enabling the 'domain_can_mmap_files' boolean.#012#012Do#012setsebool -P domain_can_mmap_files 1#012#012***** Plugin catchall (1.41 confidence) suggests **************************#012#012If you believe that python3.11 should be allowed map access on the memfd:libffi (deleted) file by default.#012Then you should report this as a bug.#012You can generate a local policy module to allow this access.#012Do#012allow this access for now by executing:#012# ausearch -c 'pulpcore-api' --raw | audit2allow -M my-pulpcoreapi#012# semodule -X 300 -i my-pulpcoreapi.pp#012 Sep 13 17:46:12 capsule616 pulpcore-api[138205]: [2024-09-13 12:16:12 +0000] [138205] [INFO] Starting gunicorn 22.0.0 Sep 13 17:46:12 capsule616 pulpcore-api[138205]: [2024-09-13 12:16:12 +0000] [138205] [INFO] Listening at: unix:/run/pulpcore-api.sock (138205) Sep 13 17:46:12 capsule616 pulpcore-api[138205]: [2024-09-13 12:16:12 +0000] [138205] [INFO] Using worker: pulpcore.app.entrypoint.PulpApiWorker Sep 13 17:46:12 capsule616 systemd[1]: Started Pulp API Server. Sep 13 17:46:12 capsule616 pulpcore-api[138246]: [2024-09-13 12:16:12 +0000] [138246] [INFO] Booting worker with pid: 138246 Sep 13 17:46:13 capsule616 pulpcore-api[138247]: [2024-09-13 12:16:12 +0000] [138247] [INFO] Booting worker with pid: 138247 Sep 13 17:46:13 capsule616 pulpcore-api[138247]: pulp [None]: pulpcore.app.entrypoint:INFO: Api App '138247@capsule616.example.com' failed to write a heartbeat to the database, sleeping for '45.0' seconds. Sep 13 17:46:13 capsule616 pulpcore-api[138246]: pulp [None]: pulpcore.app.entrypoint:INFO: Api App '138246@capsule616.example.com' failed to write a heartbeat to the database, sleeping for '45.0' seconds. Sep 13 17:46:13 capsule616 pulpcore-api[138250]: [2024-09-13 12:16:13 +0000] [138250] [INFO] Booting worker with pid: 138250 Sep 13 17:46:13 capsule616 pulpcore-api[138250]: pulp [None]: pulpcore.app.entrypoint:INFO: Api App '138250@capsule616.example.com' failed to write a heartbeat to the database, sleeping for '45.0' seconds. -- Sep 13 17:51:15 capsule616 pulpcore-api[144129]: pulp [e853c349-227c-405c-b9d9-181350db3460]: - - [13/Sep/2024:12:21:15 +0000] "GET /pulp/api/v3/distributions/rpm/rpm/?base_path=RedHat%2FStage%2FRHEL9%2Fcontent%2Fdist%2Frhel9%2F9%2Fx86_64%2Fappstream%2Fos HTTP/1.1" 200 764 "-" "OpenAPI-Generator/3.26.1/ruby" Sep 13 17:51:15 capsule616 pulpcore-api[138247]: pulp [e853c349-227c-405c-b9d9-181350db3460]: pulpcore.tasking.tasks:INFO: Starting task 0191eb54-9941-7c61-98c8-7d8b6dda4bbf
NOTE: For a capsule with enough available memory, Despite the coredumps and selinux errors, The tasks would be successful and no services would fail. The sync happens just fine.
For a capsule with less available memory ( can be mimicked by using an 8 GB ram capsule ), It will additionally terminate the worker with signal 6 and fail the sync but when we check the status of services everything looks fine.
Sep 13 16:57:35 cap616 pulpcore-api[729588]: pulp [8f542e52-4e97-4bcb-8431-68a979b53a24]: - - [13/Sep/2024:11:27:35 +0000] "GET /pulp/api/v3/tasks/0191eb1d-8e55-7061-9987-421bf859be1d/ HTTP/1.1" 200 559 "-" "OpenAPI-Generator/3.49.17/ruby" Sep 13 16:57:35 cap616 pulpcore-api[725657]: pulp [8f542e52-4e97-4bcb-8431-68a979b53a24]: - - [13/Sep/2024:11:27:35 +0000] "GET /pulp/api/v3/tasks/0191eb1e-4cab-7c93-91c2-d33e712e76ab/ HTTP/1.1" 200 559 "-" "OpenAPI-Generator/3.49.17/ruby" Sep 13 16:57:35 cap616 systemd[1]: Started Process Core Dump (PID 730078/UID 0). Sep 13 16:57:35 cap616 pulpcore-api[725657]: pulp [8f542e52-4e97-4bcb-8431-68a979b53a24]: - - [13/Sep/2024:11:27:35 +0000] "GET /pulp/api/v3/tasks/0191eb21-2e06-7f5d-94e3-267fd32c487d/ HTTP/1.1" 200 559 "-" "OpenAPI-Generator/3.49.17/ruby" Sep 13 16:57:35 cap616 systemd-coredump[730080]: Resource limits disable core dumping for process 724964 (pulpcore-api). Sep 13 16:57:35 cap616 systemd-coredump[730080]: Process 724964 (pulpcore-api) of user 988 dumped core. Sep 13 16:57:35 cap616 pulpcore-api[157818]: [2024-09-13 11:27:35 +0000] [157818] [WARNING] Worker with pid 724964 was terminated due to signal 6 Sep 13 16:57:35 cap616 systemd[1]: systemd-coredump@112-730078-0.service: Succeeded. Sep 13 16:57:35 cap616 pulpcore-api[730083]: [2024-09-13 11:27:35 +0000] [730083] [INFO] Booting worker with pid: 730083 Sep 13 16:57:36 cap616 pulpcore-api[729588]: pulp [8f542e52-4e97-4bcb-8431-68a979b53a24]: - - [13/Sep/2024:11:27:36 +0000] "GET /pulp/api/v3/tasks/0191eb21-6c4a-7c18-8bda-24e98e1c2730/ HTTP/1.1" 200 559 "-" "OpenAPI-Generator/3.49.17/ruby"
Expected behavior:
- No coredumps or killing of workers
- If the problem is due to some runtime memory consumption, then it should be improved a bit
- No selinux denials as well
Business Impact / Additional info:
This can result in sync failures for customers who have a huge amount of repo \ CV contents being synced concurrently on their target capsules.
- clones
-
SAT-27979 Concurrent content sync on a specific capsule 6.16 results in pulpcore-api dumping core
- Review