-
Bug
-
Resolution: Unresolved
-
Undefined
-
None
-
rhel-7.9.z
-
None
-
Important
-
rhel-sst-conversions
-
None
-
False
-
-
None
-
None
-
None
-
None
-
None
What were you trying to do that didn't work?
Executing a conversion on a CentOS system have a lot of 3rd party packages hangs at "ListThirdPartyPackages" task.
Analysis shows the thread listing the 3rd party packages (PID 10917) hangs writing to the pipe due to pipe being full:
10653 10:10:10.883927 execve("/usr/bin/convert2rhel", ["convert2rhel", "--debug"], ... : 10653 10:13:06.065944 pipe([5<pipe:[4657121]>, 6<pipe:[4657121]>]) = 0 <0.000017> : 10653 10:13:06.069239 clone(child_stack=NULL, flags=CLONE_CHILD_CLEARTID|CLONE_CHILD_SETTID|SIGCHLD, child_tidptr=0x7f05bf68ba10) = 10910 <0.001226> : 10910 10:13:06.089383 clone(child_stack=NULL, flags=CLONE_CHILD_CLEARTID|CLONE_CHILD_SETTID|SIGCHLD, child_tidptr=0x7f05bf68ba10) = 10911 <0.002012> : 10911 10:13:06.110839 execve("/usr/bin/repoquery", ["repoquery", "--quiet", "-q", "0:centreon-plugin-Applications-Biztalk-20221115-095034.el7.centos.noarch", ... : 10911 10:13:12.183837 exit_group(0) = ? : 10910 10:13:12.220421 clone(child_stack=0x7f05a8b9afb0, flags=CLONE_VM|CLONE_FS|CLONE_FILES|CLONE_SIGHAND|CLONE_THREAD|CLONE_SYSVSEM|CLONE_SETTLS|CLONE_PARENT_SETTID|CLONE_CHILD_CLEARTID, parent_tidptr=0x7f05a8b9b9d0, tls=0x7f05a8b9b700, child_tidptr=0x7f05a8b9b9d0) = 10917 <0.000109> : 10917 10:13:12.222829 write(6<pipe:[4657121]>, "\200\2X\311\212\1\0Package Vendor/Packager "..., 101075 <unfinished ...> 10910 10:13:12.223042 futex(0x26cd140, FUTEX_WAIT_PRIVATE, 0, NULL) = ? ERESTARTSYS (To be restarted if SA_RESTART is set) <11.902226> 10910 10:13:24.125430 --- SIGINT {si_signo=SIGINT, si_code=SI_KERNEL} ---
The pipe is full because nobody reads it.
We are executing function print_pkg_info(), which executes as a thread (from the decorator):
363 @utils.run_as_child_process 364 def print_pkg_info(pkgs): : 385 header = ( 386 "%-*s %-*s %s" 387 % ( 388 max_nvra_length, 389 "Package", 390 max_packager_length, 391 "Vendor/Packager", 392 "Repository", 393 ) 394 + "\n" 395 ) :
The thread does a process.start() then immediately after process.join() (lines 241 and 242):
141 def run_as_child_process(func): : 195 @wraps(func) 196 def wrapper(*args, **kwargs): 197 """ 198 Wrapper function to execute and control the function attached to the 199 decorator. : 232 queue = multiprocessing.Queue() 233 kwargs.update({"func": func, "queue": queue}) 234 process = Process(target=inner_wrapper, args=args, kwargs=kwargs) 235 236 # Running the process as a daemon prevents it from hanging if a SIGINT 237 # is raised, as all childs will be terminated with it. 238 # https://docs.python.org/2.7/library/multiprocessing.html#multiprocessing.Process.daemon 239 process.daemon = True 240 try: 241 process.start() 242 process.join() : 252 if not queue.empty(): 253 # We don't need to block the I/O as we are mostly done with 254 # the child process and no exception was raised, so we can 255 # instantly retrieve the item that was in the queue. 256 return queue.get(block=False)
I'm no specialist in python, but it looks like to me there is some deadlock here because the queue data is not processed.
For sure, on my reproducer using customer's RPMDB, we can see the thread with PID 10910 is on process.join() instruction:
(gdb) py-bt #4 Waiting for a lock (e.g. GIL) #5 Waiting for a lock (e.g. GIL) #7 Frame 0xd99ac0, for file /usr/lib64/python2.7/threading.py, line 339, in wait (self=<_Condition(_Verbose__verbose=False, _Condition__lock=<thread.lock at remote 0x7fab2f05ad90>, acquire=<built-in method acquire of thread.lock object at remote 0x7fab2f05ad90>, _Condition__waiters=[<thread.lock at remote 0x7fab2f05abb0>], release=<built-in method release of thread.lock object at remote 0x7fab2f05ad90>) at remote 0x7fab186b8f90>, timeout=None, balancing=True, waiter=<thread.lock at remote 0x7fab2f05abb0>, saved_state=None) waiter.acquire() #11 Frame 0x7fab1881ebc0, for file /usr/lib64/python2.7/threading.py, line 951, in join (self=<Thread(_Thread__ident=140372824950528, _Thread__block=<_Condition(_Verbose__verbose=False, _Condition__lock=<thread.lock at remote 0x7fab2f05ad90>, acquire=<built-in method acquire of thread.lock object at remote 0x7fab2f05ad90>, _Condition__waiters=[<thread.lock at remote 0x7fab2f05abb0>], release=<built-in method release of thread.lock object at remote 0x7fab2f05ad90>) at remote 0x7fab186b8f90>, _Thread__name='QueueFeederThread', _Thread__daemonic=True, _Thread__started=<_Event(_Verbose__verbose=False, _Event__flag=True, _Event__cond=<_Condition(_Verbose__verbose=False, _Condition__lock=<thread.lock at remote 0x7fab2f05ad10>, acquire=<built-in method acquire of thread.lock object at remote 0x7fab2f05ad10>, _Condition__waiters=[], release=<built-in method release of thread.lock object at remote 0x7fab2f05ad10>) at remote 0x7fab186b8910>) at remote 0x7fab186b8f50>, _Thread__stderr=<file at remote 0x7fab2f0a91e0>, _Threa...(truncated) self.__block.wait() ...
The hanging thread is itself in queue code:
(gdb) py-bt
#5 Frame 0x7fab10000b50, for file /usr/lib64/python2.7/multiprocessing/queues.py, line 266, in _feed (buffer=<collections.deque at remote 0x7fab18a55980>, notempty=<_Condition(_Verbose__verbose=False, _Condition__lock=<thread.lock at remote 0x7fab2f05abf0>, acquire=<built-in method acquire of thread.lock object at remote 0x7fab2f05abf0>, _Condition__waiters=[], release=<built-in method release of thread.lock object at remote 0x7fab2f05abf0>) at remote 0x7fab187d6850>, send=<built-in method send of _multiprocessing.Connection object at remote 0xdb4220>, writelock=<Lock(release=<built-in method release of _multiprocessing.SemLock object at remote 0x7fab187b5990>, acquire=<built-in method acquire of _multiprocessing.SemLock object at remote 0x7fab187b5990>, _semlock=<_multiprocessing.SemLock at remote 0x7fab187b5990>) at remote 0x7fab187d6250>, close=<built-in method close of _multiprocessing.Connection object at remote 0xdb4220>, is_exiting=<function at remote 0x7fab2543d320>, nacquire=<built-in method acquire of ...(truncated)
send(obj)
...
Please provide the package NVR for which bug is seen:
convert2rhel-1.4.1-1.el7.noarch
How reproducible:
Always with customer's RPMDB