-
Bug
-
Resolution: Done-Errata
-
Major
-
rhel-8.9.0
-
None
-
sanlock-3.8.4-5.el8
-
None
-
Important
-
rhel-sst-logical-storage
-
ssg_filesystems_storage_and_HA
-
20
-
20
-
2
-
QE ack, Dev ack
-
False
-
-
None
-
None
-
Pass
-
RegressionOnly
-
-
All
-
None
What were you trying to do that didn't work?
sanlock uses the system watchdog for data protection in shared storage (SAN) environments. If the watchdog does not reset the machine after the configured timeout (60 seconds), then in certain scenarios, two hosts can be accessing the same storage simultaneously, corrupting data. One of the simplest ways in which this can happen is to simply kill the sanlock daemon while it's managing leases for an application.
It was recently discovered that the iTCO_wdt watchdog driver does not reset the machine after the specified timeout (60 seconds), but rather resets the machine after two timeout periods (120 seconds). Therefore, sanlock (via its wdmd daemon) must set the watchdog timeout to half of the necessary value. i.e. the iTCO_wdt timeout needs to be set to 30 seconds in order to have it reset the machine at 60 seconds.
While this appears to be a bug in iTCO_wdt, according to the hardware specifications, it is the intended behavior:
https://uefi.org/sites/default/files/resources/Watchdog%20Descriptor%20Table.pdf
RHEV and LVM use sanlock for data protection on a SAN, and would be exposed to potential data corruption from machines using iTCO_wdt. An insights query reveals nearly 500 RHEL8 systems running sanlock and using iTCO_wdt. There are no known examples of corruption occurring from this problem.
Please provide the package NVR for which bug is seen:
How reproducible:
Steps to reproduce
Expected results
Actual results
- links to
-
RHBA-2024:128251 sanlock update
- mentioned on