Uploaded image for project: 'RHEL'
  1. RHEL
  2. RHEL-27235

md raid5 deadlock during sync check [rhel-8.8.0.z]

    • kernel-4.18.0-477.57.1.el8_8
    • None
    • None
    • ZStream
    • sst_logical_storage
    • ssg_platform_storage
    • 3
    • False
    • Hide

      None

      Show
      None
    • None
    • None
    • None

      This is a clone of issue RHEL-22698 to use for version rhel-8.8.0.z

      Original description:
      A customer's system has experienced I/O hangs when writing to a raid5 md device. The hang and deadlock occurs when a sync check is also being run against the md device.

      At first, the issue looks similar to RHEL-12284 due to the hung sync check being stuck under raid5_get_active_stripe.

      crash> ps -m 1076372
      [2 05:08:34.428] [UN] PID: 1076372 TASK: ffffa031c545c000 CPU: 4 COMMAND: "md0_resync"
      crash> bt 1076372
      PID: 1076372 TASK: ffffa031c545c000 CPU: 4 COMMAND: "md0_resync"
      #0 [ffffc1f2e2f67be0] __schedule at ffffffff85c0e191
      #1 [ffffc1f2e2f67c38] schedule at ffffffff85c0e785
      #2 [ffffc1f2e2f67c50] raid5_get_active_stripe at ffffffffc06ba90e [raid456]
      #3 [ffffc1f2e2f67cd0] raid5_sync_request at ffffffffc06bf88d [raid456]
      #4 [ffffc1f2e2f67d38] md_do_sync.cold.94 at ffffffff85996b8f
      #5 [ffffc1f2e2f67eb8] md_thread at ffffffff859887d4
      #6 [ffffc1f2e2f67f10] kthread at ffffffff8531e974
      #7 [ffffc1f2e2f67f50] ret_from_fork at ffffffff85e0028f

      However, this use of raid5_get_active_stripe does not have a stripe_request_ctx and so no batch_last to cause the deadlock in RHEL-12284.

      Instead, the main md thread for the device was deadlocked in raid5d.

      crash> ps -m 3181
      [0 15:43:48.096] [UN] PID: 3181 TASK: ffffa02e2a87c000 CPU: 11 COMMAND: "md0_raid5"
      crash> bt 3181
      PID: 3181 TASK: ffffa02e2a87c000 CPU: 11 COMMAND: "md0_raid5"
      #0 [ffffc1f2ce3e7d48] __schedule at ffffffff85c0e191
      #1 [ffffc1f2ce3e7da0] schedule at ffffffff85c0e785
      #2 [ffffc1f2ce3e7db8] raid5d at ffffffffc06be7f7 [raid456]
      #3 [ffffc1f2ce3e7eb8] md_thread at ffffffff859887d4
      #4 [ffffc1f2ce3e7f10] kthread at ffffffff8531e974
      #5 [ffffc1f2ce3e7f50] ret_from_fork at ffffffff85e0028f

      raid5d was waiting on the MD_SB_CHANGE_PENDING bit, which was set along with bit MD_SB_CHANGE_CLEAN.

      crash> struct mddev.ro,sb_flags ffff9fbf8a6e9000
      ro = 0,
      sb_flags = 6,

      These bits were set as there were several tasks which called and waited in md_write_start to transition the device from clean to active. One example:

      crash> bt 3261
      PID: 3261 TASK: ffffa00050914000 CPU: 7 COMMAND: "xfsaild/dm-9"
      #0 [ffffc1f2cdf879e8] __schedule at ffffffff85c0e191
      #1 [ffffc1f2cdf87a40] schedule at ffffffff85c0e785
      #2 [ffffc1f2cdf87a58] md_write_start at ffffffff8598c72d
      #3 [ffffc1f2cdf87ab0] raid5_make_request at ffffffffc06bac63 [raid456]
      #4 [ffffc1f2cdf87be0] md_handle_request at ffffffff8598c0ae
      #5 [ffffc1f2cdf87c48] md_make_request at ffffffff8598c25b
      #6 [ffffc1f2cdf87c68] generic_make_request_no_check at ffffffff856a16e2
      #7 [ffffc1f2cdf87cc8] submit_bio at ffffffff856a187c
      #8 [ffffc1f2cdf87d08] _xfs_buf_ioapply at ffffffffc096e97f [xfs]
      #9 [ffffc1f2cdf87db0] __xfs_buf_submit at ffffffffc09702a3 [xfs]
      #10 [ffffc1f2cdf87dd8] xfs_buf_delwri_submit_buffers at ffffffffc097108c [xfs]
      #11 [ffffc1f2cdf87e68] xfsaild at ffffffffc09a843f [xfs]
      #12 [ffffc1f2cdf87f10] kthread at ffffffff8531e974
      #13 [ffffc1f2cdf87f50] ret_from_fork at ffffffff85e0028f

      I suspect these md functions were deadlock due to a race condition resulting in a deadlock caused by upstream commit 5e2cf333b7b "md/raid5: Wait for MD_SB_CHANGE_PENDING in raid5d" which tried to fix a different race condition.

      md_write_start will try and shift a clean array to active, setting the MD_SB_CHANGE_PENDING and MD_SB_CHANGE_CLEAN flags seen set on the mddev struct. It wakes up the main thread for the md device, expecting it to run md_check_recovery to transition the device and its array to active.

      But this has a race condition with the fix added by commit 5e2cf333b7b.

      static void raid5d(struct md_thread *thread)
      {
      ...
      if (mddev->sb_flags & ~(1 << MD_SB_CHANGE_PENDING))

      { spin_unlock_irq(&conf->device_lock); md_check_recovery(mddev); spin_lock_irq(&conf->device_lock); /* * Waiting on MD_SB_CHANGE_PENDING below may deadlock * seeing md_check_recovery() is needed to clear * the flag when using mdmon. */ continue; }

      wait_event_lock_irq(mddev->sb_wait,
      !test_bit(MD_SB_CHANGE_PENDING, &mddev->sb_flags),
      conf->device_lock);
      ...

      The commit tries to wait only if MD_SB_CHANGE_PENDING alone is set. If other bits are set, it calls md_check_recovery as desired by md_write_start. However, using separate checks in the if statement and from wait_event_lock_irq can race, allowing the setting of MD_SB_CHANGE_PENDING and MD_SB_CHANGE_CLEAN to occur after the if statement sees a 0 value but before wait_event_lock_irq checks for MD_SB_CHANGE_PENDING. When this race occurs, raid5d will become stuck in a state where it can never run md_check_recovery to clear MD_SB_CHANGE_PENDING. The raid5 code is using MD_SB_CHANGE_PENDING for multiple purposes, and unable to safely distinguish between the different states.

      A vmcore of this deadlock is on galvatron-x86.cee.redhat.com at /cores/retrace/tasks/755512355/crash/vmcore

      What were you trying to do that didn't work?

      Writes may hang and deadlock to a raid5 md device during a sync check.

      Please provide the package NVR for which bug is seen:

      kernel 4.18.0-513.11.1.el8_9.x86_64

      How reproducible:

      Occurs randomly for one customer when the weekly raid check cron job runs.

            rvrbovsk@redhat.com Radomir Vrbovsky
            watson-automation Watson Automation
            Nigel Croxon Nigel Croxon
            Fan Fan Fan Fan
            Votes:
            0 Vote for this issue
            Watchers:
            8 Start watching this issue

              Created:
              Updated:
              Resolved: