-
Bug
-
Resolution: Done-Errata
-
Critical
-
None
-
kernel-4.18.0-477.57.1.el8_8
-
None
-
Low
-
ZStream
-
rhel-sst-logical-storage
-
ssg_filesystems_storage_and_HA
-
3
-
False
-
-
None
-
None
-
None
This is a clone of issue RHEL-22698 to use for version rhel-8.8.0.z
–
Original description:
A customer's system has experienced I/O hangs when writing to a raid5 md device. The hang and deadlock occurs when a sync check is also being run against the md device.
At first, the issue looks similar to RHEL-12284 due to the hung sync check being stuck under raid5_get_active_stripe.
crash> ps -m 1076372
[2 05:08:34.428] [UN] PID: 1076372 TASK: ffffa031c545c000 CPU: 4 COMMAND: "md0_resync"
crash> bt 1076372
PID: 1076372 TASK: ffffa031c545c000 CPU: 4 COMMAND: "md0_resync"
#0 [ffffc1f2e2f67be0] __schedule at ffffffff85c0e191
#1 [ffffc1f2e2f67c38] schedule at ffffffff85c0e785
#2 [ffffc1f2e2f67c50] raid5_get_active_stripe at ffffffffc06ba90e [raid456]
#3 [ffffc1f2e2f67cd0] raid5_sync_request at ffffffffc06bf88d [raid456]
#4 [ffffc1f2e2f67d38] md_do_sync.cold.94 at ffffffff85996b8f
#5 [ffffc1f2e2f67eb8] md_thread at ffffffff859887d4
#6 [ffffc1f2e2f67f10] kthread at ffffffff8531e974
#7 [ffffc1f2e2f67f50] ret_from_fork at ffffffff85e0028f
However, this use of raid5_get_active_stripe does not have a stripe_request_ctx and so no batch_last to cause the deadlock in RHEL-12284.
Instead, the main md thread for the device was deadlocked in raid5d.
crash> ps -m 3181
[0 15:43:48.096] [UN] PID: 3181 TASK: ffffa02e2a87c000 CPU: 11 COMMAND: "md0_raid5"
crash> bt 3181
PID: 3181 TASK: ffffa02e2a87c000 CPU: 11 COMMAND: "md0_raid5"
#0 [ffffc1f2ce3e7d48] __schedule at ffffffff85c0e191
#1 [ffffc1f2ce3e7da0] schedule at ffffffff85c0e785
#2 [ffffc1f2ce3e7db8] raid5d at ffffffffc06be7f7 [raid456]
#3 [ffffc1f2ce3e7eb8] md_thread at ffffffff859887d4
#4 [ffffc1f2ce3e7f10] kthread at ffffffff8531e974
#5 [ffffc1f2ce3e7f50] ret_from_fork at ffffffff85e0028f
raid5d was waiting on the MD_SB_CHANGE_PENDING bit, which was set along with bit MD_SB_CHANGE_CLEAN.
crash> struct mddev.ro,sb_flags ffff9fbf8a6e9000
ro = 0,
sb_flags = 6,
These bits were set as there were several tasks which called and waited in md_write_start to transition the device from clean to active. One example:
crash> bt 3261
PID: 3261 TASK: ffffa00050914000 CPU: 7 COMMAND: "xfsaild/dm-9"
#0 [ffffc1f2cdf879e8] __schedule at ffffffff85c0e191
#1 [ffffc1f2cdf87a40] schedule at ffffffff85c0e785
#2 [ffffc1f2cdf87a58] md_write_start at ffffffff8598c72d
#3 [ffffc1f2cdf87ab0] raid5_make_request at ffffffffc06bac63 [raid456]
#4 [ffffc1f2cdf87be0] md_handle_request at ffffffff8598c0ae
#5 [ffffc1f2cdf87c48] md_make_request at ffffffff8598c25b
#6 [ffffc1f2cdf87c68] generic_make_request_no_check at ffffffff856a16e2
#7 [ffffc1f2cdf87cc8] submit_bio at ffffffff856a187c
#8 [ffffc1f2cdf87d08] _xfs_buf_ioapply at ffffffffc096e97f [xfs]
#9 [ffffc1f2cdf87db0] __xfs_buf_submit at ffffffffc09702a3 [xfs]
#10 [ffffc1f2cdf87dd8] xfs_buf_delwri_submit_buffers at ffffffffc097108c [xfs]
#11 [ffffc1f2cdf87e68] xfsaild at ffffffffc09a843f [xfs]
#12 [ffffc1f2cdf87f10] kthread at ffffffff8531e974
#13 [ffffc1f2cdf87f50] ret_from_fork at ffffffff85e0028f
I suspect these md functions were deadlock due to a race condition resulting in a deadlock caused by upstream commit 5e2cf333b7b "md/raid5: Wait for MD_SB_CHANGE_PENDING in raid5d" which tried to fix a different race condition.
md_write_start will try and shift a clean array to active, setting the MD_SB_CHANGE_PENDING and MD_SB_CHANGE_CLEAN flags seen set on the mddev struct. It wakes up the main thread for the md device, expecting it to run md_check_recovery to transition the device and its array to active.
But this has a race condition with the fix added by commit 5e2cf333b7b.
static void raid5d(struct md_thread *thread)
{
...
if (mddev->sb_flags & ~(1 << MD_SB_CHANGE_PENDING))
wait_event_lock_irq(mddev->sb_wait,
!test_bit(MD_SB_CHANGE_PENDING, &mddev->sb_flags),
conf->device_lock);
...
The commit tries to wait only if MD_SB_CHANGE_PENDING alone is set. If other bits are set, it calls md_check_recovery as desired by md_write_start. However, using separate checks in the if statement and from wait_event_lock_irq can race, allowing the setting of MD_SB_CHANGE_PENDING and MD_SB_CHANGE_CLEAN to occur after the if statement sees a 0 value but before wait_event_lock_irq checks for MD_SB_CHANGE_PENDING. When this race occurs, raid5d will become stuck in a state where it can never run md_check_recovery to clear MD_SB_CHANGE_PENDING. The raid5 code is using MD_SB_CHANGE_PENDING for multiple purposes, and unable to safely distinguish between the different states.
A vmcore of this deadlock is on galvatron-x86.cee.redhat.com at /cores/retrace/tasks/755512355/crash/vmcore
What were you trying to do that didn't work?
Writes may hang and deadlock to a raid5 md device during a sync check.
Please provide the package NVR for which bug is seen:
kernel 4.18.0-513.11.1.el8_9.x86_64
How reproducible:
Occurs randomly for one customer when the weekly raid check cron job runs.