Uploaded image for project: 'RHEL'
  1. RHEL
  2. RHEL-21814

adjust times for iTCO_wdt watchdog driver

    • Icon: Bug Bug
    • Resolution: Done-Errata
    • Icon: Major Major
    • rhel-8.10
    • rhel-8.9.0
    • sanlock
    • None
    • sanlock-3.8.4-5.el8
    • None
    • Important
    • rhel-sst-logical-storage
    • ssg_filesystems_storage_and_HA
    • 20
    • 20
    • 2
    • QE ack, Dev ack
    • False
    • Hide

      None

      Show
      None
    • None
    • None
    • All
    • None

      What were you trying to do that didn't work?

       

      sanlock uses the system watchdog for data protection in shared storage (SAN) environments.  If the watchdog does not reset the machine after the configured timeout (60 seconds), then in certain scenarios, two hosts can be accessing the same storage simultaneously, corrupting data.  One of the simplest ways in which this can happen is to simply kill the sanlock daemon while it's managing leases for an application.

       

      It was recently discovered that the iTCO_wdt watchdog driver does not reset the machine after the specified timeout (60 seconds), but rather resets the machine after two timeout periods (120 seconds).  Therefore, sanlock (via its wdmd daemon) must set the watchdog timeout to half of the necessary value.  i.e. the iTCO_wdt timeout needs to be set to 30 seconds in order to have it reset the machine at 60 seconds.

       

      While this appears to be a bug in iTCO_wdt, according to the hardware specifications, it is the intended behavior:

      https://uefi.org/sites/default/files/resources/Watchdog%20Descriptor%20Table.pdf

       

      RHEV and LVM use sanlock for data protection on a SAN, and would be exposed to potential data corruption from machines using iTCO_wdt.  An insights query reveals nearly 500 RHEL8 systems running sanlock and using iTCO_wdt.  There are no known examples of corruption occurring from this problem.

       

      Please provide the package NVR for which bug is seen:

      How reproducible:

      Steps to reproduce

      1.  
      2.  
      3.  

      Expected results

      Actual results

              teigland@redhat.com David Teigland
              teigland@redhat.com David Teigland
              David Teigland David Teigland
              Cluster QE Cluster QE
              Votes:
              0 Vote for this issue
              Watchers:
              5 Start watching this issue

                Created:
                Updated:
                Resolved: