Loading...

Type: Bug
Resolution: Unresolved
Priority: Undefined
Fix Version/s: None
Affects Version/s: rhel-7.9.z
Component/s: convert2rhel
Labels:
- auto-close-warning

Regression:
None
Severity:
Important

Pool Team:

rhel-sst-conversions

Story Points:
None
Blocked:
False
Blocked Reason:

Hide

None

Show
None
Product Documentation Required:
None
Sprint:
None

Preliminary Testing:
None
Test Coverage:
None

Experience:

PX Impact Score:
PX Priority Data:
PX Review Complete:
SFDC Cases Counter:
SFDC Cases Open:
SFDC Cases Links:

Planning:
None

What were you trying to do that didn't work?

Executing a conversion on a CentOS system have a lot of 3rd party packages hangs at "ListThirdPartyPackages" task.

Analysis shows the thread listing the 3rd party packages (PID 10917) hangs writing to the pipe due to pipe being full:

10653 10:10:10.883927 execve("/usr/bin/convert2rhel", ["convert2rhel", "--debug"], ...
 :
10653 10:13:06.065944 pipe([5<pipe:[4657121]>, 6<pipe:[4657121]>]) = 0 <0.000017>
 :
10653 10:13:06.069239 clone(child_stack=NULL, flags=CLONE_CHILD_CLEARTID|CLONE_CHILD_SETTID|SIGCHLD, child_tidptr=0x7f05bf68ba10) = 10910 <0.001226>
 :
10910 10:13:06.089383 clone(child_stack=NULL, flags=CLONE_CHILD_CLEARTID|CLONE_CHILD_SETTID|SIGCHLD, child_tidptr=0x7f05bf68ba10) = 10911 <0.002012>
 :
10911 10:13:06.110839 execve("/usr/bin/repoquery", ["repoquery", "--quiet", "-q", "0:centreon-plugin-Applications-Biztalk-20221115-095034.el7.centos.noarch", ...
 :
10911 10:13:12.183837 exit_group(0)     = ?
 :
10910 10:13:12.220421 clone(child_stack=0x7f05a8b9afb0, flags=CLONE_VM|CLONE_FS|CLONE_FILES|CLONE_SIGHAND|CLONE_THREAD|CLONE_SYSVSEM|CLONE_SETTLS|CLONE_PARENT_SETTID|CLONE_CHILD_CLEARTID, parent_tidptr=0x7f05a8b9b9d0, tls=0x7f05a8b9b700, child_tidptr=0x7f05a8b9b9d0) = 10917 <0.000109>
 :
10917 10:13:12.222829 write(6<pipe:[4657121]>, "\200\2X\311\212\1\0Package                                                                                                Vendor/Packager   "..., 101075 <unfinished ...>
10910 10:13:12.223042 futex(0x26cd140, FUTEX_WAIT_PRIVATE, 0, NULL) = ? ERESTARTSYS (To be restarted if SA_RESTART is set) <11.902226>
10910 10:13:24.125430 --- SIGINT {si_signo=SIGINT, si_code=SI_KERNEL} ---

The pipe is full because nobody reads it.

We are executing function print_pkg_info(), which executes as a thread (from the decorator):

 363 @utils.run_as_child_process
 364 def print_pkg_info(pkgs):
 :
 385     header = (
 386         "%-*s  %-*s  %s"
 387         % (
 388             max_nvra_length,
 389             "Package",
 390             max_packager_length,
 391             "Vendor/Packager",
 392             "Repository",
 393         )
 394         + "\n"
 395     )
 :

The thread does a process.start() then immediately after process.join() (lines 241 and 242):

 141 def run_as_child_process(func):
 :
 195     @wraps(func)
 196     def wrapper(*args, **kwargs):
 197         """
 198         Wrapper function to execute and control the function attached to the
 199         decorator.
 :
 232         queue = multiprocessing.Queue()
 233         kwargs.update({"func": func, "queue": queue})
 234         process = Process(target=inner_wrapper, args=args, kwargs=kwargs)
 235 
 236         # Running the process as a daemon prevents it from hanging if a SIGINT
 237         # is raised, as all childs will be terminated with it.
 238         # https://docs.python.org/2.7/library/multiprocessing.html#multiprocessing.Process.daemon
 239         process.daemon = True
 240         try:
 241             process.start()
 242             process.join()
 :
 252             if not queue.empty():
 253                 # We don't need to block the I/O as we are mostly done with
 254                 # the child process and no exception was raised, so we can
 255                 # instantly retrieve the item that was in the queue.
 256                 return queue.get(block=False)

I'm no specialist in python, but it looks like to me there is some deadlock here because the queue data is not processed.

For sure, on my reproducer using customer's RPMDB, we can see the thread with PID 10910 is on process.join() instruction:

(gdb) py-bt
#4 Waiting for a lock (e.g. GIL)
#5 Waiting for a lock (e.g. GIL)
#7 Frame 0xd99ac0, for file /usr/lib64/python2.7/threading.py, line 339, in wait (self=<_Condition(_Verbose__verbose=False, _Condition__lock=<thread.lock at remote 0x7fab2f05ad90>, acquire=<built-in method acquire of thread.lock object at remote 0x7fab2f05ad90>, _Condition__waiters=[<thread.lock at remote 0x7fab2f05abb0>], release=<built-in method release of thread.lock object at remote 0x7fab2f05ad90>) at remote 0x7fab186b8f90>, timeout=None, balancing=True, waiter=<thread.lock at remote 0x7fab2f05abb0>, saved_state=None)
    waiter.acquire()
#11 Frame 0x7fab1881ebc0, for file /usr/lib64/python2.7/threading.py, line 951, in join (self=<Thread(_Thread__ident=140372824950528, _Thread__block=<_Condition(_Verbose__verbose=False, _Condition__lock=<thread.lock at remote 0x7fab2f05ad90>, acquire=<built-in method acquire of thread.lock object at remote 0x7fab2f05ad90>, _Condition__waiters=[<thread.lock at remote 0x7fab2f05abb0>], release=<built-in method release of thread.lock object at remote 0x7fab2f05ad90>) at remote 0x7fab186b8f90>, _Thread__name='QueueFeederThread', _Thread__daemonic=True, _Thread__started=<_Event(_Verbose__verbose=False, _Event__flag=True, _Event__cond=<_Condition(_Verbose__verbose=False, _Condition__lock=<thread.lock at remote 0x7fab2f05ad10>, acquire=<built-in method acquire of thread.lock object at remote 0x7fab2f05ad10>, _Condition__waiters=[], release=<built-in method release of thread.lock object at remote 0x7fab2f05ad10>) at remote 0x7fab186b8910>) at remote 0x7fab186b8f50>, _Thread__stderr=<file at remote 0x7fab2f0a91e0>, _Threa...(truncated)
    self.__block.wait()
...

The hanging thread is itself in queue code:

(gdb) py-bt
#5 Frame 0x7fab10000b50, for file /usr/lib64/python2.7/multiprocessing/queues.py, line 266, in _feed (buffer=<collections.deque at remote 0x7fab18a55980>, notempty=<_Condition(_Verbose__verbose=False, _Condition__lock=<thread.lock at remote 0x7fab2f05abf0>, acquire=<built-in method acquire of thread.lock object at remote 0x7fab2f05abf0>, _Condition__waiters=[], release=<built-in method release of thread.lock object at remote 0x7fab2f05abf0>) at remote 0x7fab187d6850>, send=<built-in method send of _multiprocessing.Connection object at remote 0xdb4220>, writelock=<Lock(release=<built-in method release of _multiprocessing.SemLock object at remote 0x7fab187b5990>, acquire=<built-in method acquire of _multiprocessing.SemLock object at remote 0x7fab187b5990>, _semlock=<_multiprocessing.SemLock at remote 0x7fab187b5990>) at remote 0x7fab187d6250>, close=<built-in method close of _multiprocessing.Connection object at remote 0xdb4220>, is_exiting=<function at remote 0x7fab2543d320>, nacquire=<built-in method acquire of ...(truncated)
    send(obj)
...

Please provide the package NVR for which bug is seen:

convert2rhel-1.4.1-1.el7.noarch

How reproducible:

Always with customer's RPMDB

Details

Description

What were you trying to do that didn't work?

Please provide the package NVR for which bug is seen:

How reproducible:

Attachments

Easy Agile Planning Poker

Activity

People

Dates