Uploaded image for project: 'RHEL'
  1. RHEL
  2. RHEL-4320

/boot/grub2/grubenv's timestamp is getting modified continuously due to "boot_success" implementation

    • Icon: Bug Bug
    • Resolution: Unresolved
    • Icon: Normal Normal
    • None
    • rhel-9.0.0
    • aide
    • None
    • Moderate
    • rhel-sst-security-special-projects
    • ssg_security
    • None
    • False
    • Hide

      None

      Show
      None
    • None
    • None
    • None
    • None
    • If docs needed, set a value
    • None

      Description of problem:

      We have a customer noticing that the /boot/grub2/grubenv file is getting modified continuously, which raises an alert when executing aide monitoring tool (https://aide.github.io/).
      This happens because every time a user logs in and has a session living for more than 2 minutes, the user's grub-boot-success.timer elapses, causing grub-boot-success.service unit to write "boot_success=1" into /boot/grub2/grubenv.

      I have several concerns about this feature and the current implementation.

      Regarding the feature itself, I understand the idea behind this is to automatically boot on the previous kernel if the newer kernel fails to boot several times in a row.
      On RHEL, this doesn't work for many reason:
      1. usually the latest kernel fails to boot because the initramfs is broken, which leads to a kernel panic and system sitting there
      2. assuming kernel and initramfs boot fine, if the system fails after switch root, it's expected to fail whatever the booted kernel is
      3. the feature is not enabled by default since its implementation inside grub.cfg relies on a "boot_counter" variable which isn't set
      4. the implementation inside grub.cfg assumes the second menu entry will boot better, which is far from being true: on customer systems it's not rare that kernel updates are performed but no reboot is made in-between, waiting for a "reboot window"
      5. it's unclear when a RHEL system should be considered as "booted fine": many systems are acting as services only (e.g. database server) with no user except root connecting to the system

      Regarding the implementaion (the grub-boot-success.timer/grub-boot-success.service units) itself, it doesn't work on RHEL in many cases:
      1. when there is only a root user accessing the system (because "ConditionUser=!@system" doesn't evaluate to true when uid is 0)
      2. if other users are not using the "systemd --user" service, which is usually what we recommend and will be the default in some future for non-graphical installations
      3. when users use "lingering" (i.e. starting "systemd --user" service at boot without actually logging into the system): the "boot_success=1" flag will be automatically set after 2 minutes even if users cannot effectively log in
      4. there may be a race if 2 users log in concurrently: after 2 minutes, both grub-boot-success.service units will execute concurrently, which may break the grubenv file content if no proper locking is performed (I didn't check the code to confirm/infirm this)

      For all these reasons and many more, I think this functionality is not wished on RHEL at all, it may bring more trouble than what it tries to solve.
      For sure, if we decide to continue with this, the implementation needs to be enhanced.

      Version-Release number of selected component (if applicable):

      grub2-tools-minimal since RHEL8

      How reproducible:

      N/A

              rsroka@redhat.com Radovan Sroka
              rhn-support-rmetrich Renaud Métrich
              Radovan Sroka Radovan Sroka
              SSG Security QE SSG Security QE
              Votes:
              0 Vote for this issue
              Watchers:
              12 Start watching this issue

                Created:
                Updated: