Uploaded image for project: 'RHEL'
  1. RHEL
  2. RHEL-1235

dnf unfinished transaction when upgrading via SSH session

    • sst_cs_software_management
    • ssg_core_services
    • 2
    • False
    • Hide

      None

      Show
      None
    • DNF Sprint 4
    • If docs needed, set a value

      +++ This bug was initially created as a clone of Bug #2127943 +++

      Description of problem:

      A dnf global update, with installation and upgrade of different package, is reported in failure status in the database, but the operation of upgrading and installation seems concluded fine.
      The package are installed and upgraded.

      Also, the timestamp of end of the operation, is reported at the 01 Jan 1970

      Version-Release number of selected component (if applicable):
      dnf-4.7.0-8.el8.noarch

      How reproducible:
      I tried to reproduce it but couldn't

      Additional info:

      This is the output of the history info:

      Transaction ID : 18
      Begin time : Wed 06 Jul 2022 07:16:13 PM CEST
      Begin rpmdb : 572:77859c16fafa7db2b341622ef1a47f7485e1c9ff
      End time : Thu 01 Jan 1970 01:00:00 AM CET (-1657127773 seconds)
      End rpmdb :
      User : 20396
      Return-Code : Failure: 1
      Releasever :
      Command Line :
      Comment :

      I try to find a reference for the "Failure Code 1", but i can't find anything about it.

      In the documentation of dnf, there isn't a description of the error codes to understand where is the issue.

      Thanks and best regards

      Giancarlo

      — Additional comment from Jan Kolarik on 2022-09-29 10:29:52 CEST —

      I was inspecting the logs from the customer portal, specifically the SOS report from the sosreport-lprdsapa007-Case03278642-2022-08-02-qktqtyf.tar.xz file and I think I found the cause of the problem.

      The DNF upgrade was done through the SSH session. DNF started the RPM transaction, then packages upgrades were following. Somewhere in the middle of the transaction the SSH server upgrade proceeded which resulted in SSH session shutdown trigger. Therefore the RPM transaction was not properly ended and the data in history DB were left in uninitialized state (end time in trans table, states in trans_item, etc.).

      Also the confusing thing could be that when comparing the rpm log with the transaction list from history DB, it seems that all the individual packages were upgraded successfully, probably because the SSH session shutdown has a timeout set up to wait for child processes to exit and RPM transaction made it in time, although the DNF process itself seems that didn't (maybe the RPM transaction handles the signals differently than DNF in general ?).

      Mixed data from the logs regarding the transaction from Jul 27 2022 23:10:56 GMT-0700, having DNF transaction ID = 19:

      Jul 27 23:10:09 lprdsapa007 systemd[1]: Started User Manager for UID 20456.
      Jul 27 23:10:09 lprdsapa007 systemd[1]: Started User Manager for UID 20456.
      Jul 27 23:10:09 lprdsapa007 systemd[1]: Started Session 1068 of user mk4c.
      Jul 27 23:10:09 lprdsapa007 systemd[1]: Started Session 1068 of user mk4c.
      Jul 27 23:10:10 lprdsapa007 ansible-setup[3998574]: Invoked with filter=ansible_pkg_mgr gather_subset=['!all'] gather_timeout=10 fact_path=/etc/ansible/facts.d
      Jul 27 23:10:10 lprdsapa007 ansible-setup[3998574]: Invoked with filter=ansible_pkg_mgr gather_subset=['!all'] gather_timeout=10 fact_path=/etc/ansible/facts.d
      Jul 27 23:10:11 lprdsapa007 ansible-dnf[3998636]: Invoked with name=['*'] state=latest allow_downgrade=False autoremove=False bugfix=False disable_gpg_check=False disable_plugin=[] disablerepo=[] download_only=False enable_plugin=[] enablerepo=[] exclude=[] installroot=/ install_repoquery=True install_weak_deps=True security=False skip_broken=False update_cache=False update_only=False validate_certs=True lock_timeout=30 conf_file=None disable_excludes=None download_dir=None list=None releasever=None
      Jul 27 23:10:11 lprdsapa007 ansible-dnf[3998636]: Invoked with name=['*'] state=latest allow_downgrade=False autoremove=False bugfix=False disable_gpg_check=False disable_plugin=[] disablerepo=[] download_only=False enable_plugin=[] enablerepo=[] exclude=[] installroot=/ install_repoquery=True install_weak_deps=True security=False skip_broken=False update_cache=False update_only=False validate_certs=True lock_timeout=30 conf_file=None disable_excludes=None download_dir=None list=None releasever=None

      ...

      2022-07-27T23:15:00-0700 SUBDEBUG Upgraded: openssh-server-8.0p1-10.el8.x86_64

      ...

      Jul 27 23:15:00 lprdsapa007 sshd[1240]: Received signal 15; terminating.
      Jul 27 23:15:00 lprdsapa007 sshd[1240]: Received signal 15; terminating.
      Jul 27 23:15:00 lprdsapa007 sshd[4043642]: Server listening on 0.0.0.0 port 22.
      Jul 27 23:15:00 lprdsapa007 sshd[4043642]: Server listening on 0.0.0.0 port 22.
      Jul 27 23:15:07 lprdsapa007 polkitd[4044028]: Loading rules from directory /etc/polkit-1/rules.d
      Jul 27 23:15:07 lprdsapa007 polkitd[4044028]: Loading rules from directory /usr/share/polkit-1/rules.d
      Jul 27 23:15:07 lprdsapa007 polkitd[4044028]: Loading rules from directory /etc/polkit-1/rules.d
      Jul 27 23:15:07 lprdsapa007 polkitd[4044028]: Loading rules from directory /usr/share/polkit-1/rules.d
      Jul 27 23:15:07 lprdsapa007 polkitd[4044028]: Finished loading, compiling and executing 2 rules
      Jul 27 23:15:07 lprdsapa007 polkitd[4044028]: Finished loading, compiling and executing 2 rules
      Jul 27 23:15:07 lprdsapa007 polkitd[4044028]: Acquired the name org.freedesktop.PolicyKit1 on the system bus
      Jul 27 23:15:07 lprdsapa007 polkitd[4044028]: Acquired the name org.freedesktop.PolicyKit1 on the system bus
      Jul 27 23:15:11 lprdsapa007 sshd[3998555]: Timeout, client not responding from user mk4c 10.161.36.64 port 59652
      Jul 27 23:15:11 lprdsapa007 sshd[3998542]: pam_unix(sshd:session): session closed for user mk4c
      Jul 27 23:15:11 lprdsapa007 sudo[3998634]: pam_unix(sudo:session): session closed for user root
      Jul 27 23:15:11 lprdsapa007 systemd-logind[1413]: Session 1068 logged out. Waiting for processes to exit.

      ...

      // Last record related to this transaction from the dnf.rpm.log:

      2022-07-27T23:15:51-0700 SUBDEBUG Upgraded: libgcc-8.5.0-4.el8_5.x86_64

      — Additional comment from Jan Kolarik on 2022-09-29 10:43:35 CEST —

      Ad to last comment: the RPM transaction seems to be complete, but it was terminated somewhere inside or in the wrapper, because the last message in the DNF log is "RPM transaction start." and this is the code of the transaction execution from DNF base.py:

      logger.log(dnf.logging.DDEBUG, 'RPM transaction start.')
      errors = self._ts.run(cb.callback, '')
      logger.log(dnf.logging.DDEBUG, 'RPM transaction over.')

      — Additional comment from Panu Matilainen on 2022-10-05 14:38:33 CEST —

      Rpm blocks deadly signals during the transaction, but when those are unblocked they hit the dnf side which may still have processing left to do.

      Rpm does export the refcounted signal blocking mechanism, so dnf could call rpmsqBlock(1) when it's own transaction processing starts and once the rpm-side transaction has completed and you've done all dnf-side processing, call rpmsqBlock(0) to release it. This is exported to Python too as rpm.blockSignals() in RHEL >= 8 IIRC.
      That way there would be no signal handling conflicts or tiny race windows when signal handling changes.

      — Additional comment from Jan Kolarik on 2022-10-05 17:35:22 CEST —

      Thank you Panu for the tip! Although we agreed to replan implementing this for the DNF 5, I tried to propose a simple fix for the DNF4. I was testing it locally by sending the SIGTERM signal to the DNF process during the RPM transaction processing.

      Following PR was queued to fix this: https://github.com/rpm-software-management/dnf/pull/1854

            jkolarik Jan Kolarik
            jkolarik Jan Kolarik
            packaging-team-maint packaging-team-maint
            Software Management QE Software Management QE
            Votes:
            0 Vote for this issue
            Watchers:
            8 Start watching this issue

              Created:
              Updated:
              Resolved: