Uploaded image for project: 'OpenShift Virtualization'
  1. OpenShift Virtualization
  2. CNV-55651

Windows Server Failover Cluster (WSFC) validation is not working with multipath LUNs

XMLWordPrintable

    • Icon: Bug Bug
    • Resolution: Done-Errata
    • Icon: Critical Critical
    • CNV v4.17.8
    • CNV v4.17.3
    • Storage Ecosystem
    • None
    • Quality / Stability / Reliability
    • 13
    • False
    • Hide

      None

      Show
      None
    • False
    • None
    • CNV Storage 268, CNV Storage 269
    • Critical
    • Customer Reported
    • None

      Description of problem:

      I configured two Windows 2019 servers for WSFC . The cluster VMs are running on two separate nodes. The disk used is an iSCSI LUN and multipath is enabled on the OCP nodes.

      The disks are passed with "reservation: true". While doing the validation test, from the Windows, it pass the "list disks" test, but fails at "Validate SCSI-3 Persistent Reservation" with following error:

      Failure issuing call to Persistent Reservation RESERVE on Test Disk 0 from node WIN-5E7SHBGUBJP.mywincluster.com when that node has successfully registered. It is expected to succeed. The request could not be performed because of an I/O device error.
      .
      Test Disk 0 does not provide Persistent Reservations support for the mechanisms used by failover clusters. Some storage devices require specific firmware versions or settings to function properly with failover clusters. Please contact your storage administrator or storage vendor to check the configuration of the storage to allow it to function properly with failover clusters.
      

      I can add the disk in the cluster and will be online in one of the node. However, if I drain the owner node, the status of the disks go offline with following error:

      Cluster resource 'Cluster Disk 1' of type 'Physical Disk' in clustered role 'Available Storage' failed. The error code was '0xaa' ('The requested resource is in use.').

      I straced the qemu-pr process from the node and I can see following error while  doing the new reservation:

      132910 08:37:03.867058 ioctl(15</dev/sdb<block 8:16>>, SG_IO, {interface_id='S', dxfer_direction=SG_DXFER_TO_DEV, cmd_len=10, cmdp="\x5f\x01\x05\x00\x00\x00\x00\x00\x18\x00", mx_sb_len=160, iovec_count=0, dxfer_len=24, timeout=2000, flags=0, dxferp="\x6f\x2b\x80\x34\x74\x66\x73\x4d\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00", status=0x18, masked_status=0xc, msg_status=0, sb_len_wr=0, sbp="", host_status=0, driver_status=0, resid=0, duration=19, info=SG_INFO_CHECK}) = 0
      
      3684797 08:37:03.938222 write(2<pipe:[483278317]>, "mpathb: configured reservation key doesn't match: 0x0\n", 54) = 54

      Error is from the mpathpersist api. It's showing key as "0x0". Also, note that qemu-pr  connections to multipathd is failing:

      3684797 08:37:08.176570 connect(15<UNIX-STREAM:[489686834]>, {sa_family=AF_UNIX, sun_path=@"/org/kernel/linux/storage/multipathd"}, 39) = -1 ECONNREFUSED (Connection refused)

      Since it runs in the virt-handler pod, it currently don't have a way to communicate with multipathd in the node. So it cannot send the key to the multipathd and will not be able to save the key. Is this the reason it is showing the key in the error as  "0x0"? Not sure mpathpersist will work without multipathd daemon.

      If I remove the disk from the multipath all the validations tests are passing and I can move the disks ownership between the nodes without any problem.

      Version-Release number of selected component (if applicable):

      OpenShift Virtualization            4.17.3

      How reproducible:

      100%

      Steps to Reproduce:

      1. Create a iSCSI PV and PVC.

      2. Enable multipath in the OCP nodes where the VM is running:

      # mpathconf --enable
      # systemctl restart multipathd
      

      3. Since I only have single path, I also have to set find_multipaths no. Before the test, confirm that the iSCSI device is added to the multipath.

      4. Create an AD server and connect two Windows server 2019 VMs  to this AD server.

      5. Pass the disk to both the VMs:

                - lun:
                    bus: scsi
                    reservation: true
                  name: disk-chocolate-crane-84
                  shareable: true

      6. Create the WSFC cluster and try validation test from it. It will fail in the test  "Validate SCSI-3 Persistent Reservation".

      Actual results:

      Windows Server Failover Cluster (WSFC) validation is not working with multipath LUNs

      Expected results:

      Most production environment will have multiple paths for their SAN LUNs. So WSFC should work with multipath.

      Additional info:

       

        1. strace.txt
          67 kB
        2. Screenshot from 2025-03-16 21-31-34.png
          Screenshot from 2025-03-16 21-31-34.png
          322 kB
        3. Screenshot from 2025-03-16 20-41-43.png
          Screenshot from 2025-03-16 20-41-43.png
          395 kB
        4. report.pdf
          353 kB
        5. qemu-pr-helper_strace.tar
          1.64 MB
        6. qemu-pr-helper_strace
          1.43 MB

              afrosirh Alice Frosi
              rhn-support-nashok Nijin Ashok
              Alice Frosi
              Kevin Alon Goldblatt Kevin Alon Goldblatt
              Votes:
              1 Vote for this issue
              Watchers:
              29 Start watching this issue

                Created:
                Updated:
                Resolved: