Uploaded image for project: 'RHEL'
  1. RHEL
  2. RHEL-17811

qla2xxx: NVMe-FC path is not restored after multiple failover/failbacks on Storage array

    • None
    • Low
    • rhel-sst-storage-io
    • ssg_filesystems_storage_and_HA
    • 3
    • False
    • Hide

      None

      Show
      None
    • None
    • None
    • None
    • None
    • x86_64
    • None

      While running pNATE for RHEL-8.10, the test failed during path validation.  

       After a failover and giveback, plate-client-03 is only reporting 3 paths instead of 4 to the NVMe-FC namespaces:

      1. nvme list-subsys /dev/nvme2n1
        nvme-subsys2 - NQN=nqn.1992-08.com.netapp:sn.dd2bb30cfa2a11ed8f2400a098cbcac6:subsystem.nvme_1
        \
         +- nvme0 fc traddr=nn-0x211600a098cbcac6:pn-0x213b00a098cbcac6 host_traddr=nn-0x2000f4c7aa065db5:pn-0x2100f4c7aa065db5 live optimized
         +- nvme1 fc traddr=nn-0x211600a098cbcac6:pn-0x207d00a098cbcac6 host_traddr=nn-0x2000f4c7aa065db5:pn-0x2100f4c7aa065db5 live non-optimized
         +- nvme4 fc traddr=nn-0x211600a098cbcac6:pn-0x200900a098cbcac6 host_traddr=nn-0x2000f4c7aa065db4:pn-0x2100f4c7aa065db4 live optimized

      No issues with the FC LUNs using the same initiator ports:

       #multipath -ll:
      3600a098038304267573f4d3778506432 dm-19 NETAPP,LUN C-Mode
      size=80G features='3 queue_if_no_path pg_init_retries 50' hwhandler='1 alua' wp=rw

      + policy='service-time 0' prio=50 status=active
       
      • 11:0:3:5  sdbs 68:96  active ready running
      `- 9:0:3:5   sdac 65:192 active ready running
      `+ policy='service-time 0' prio=10 status=enabled
       
      • 9:0:2:5   sdi  8:128  active ready running
          `- 11:0:1:5  sdax 67:16  active ready running

      Below are the logs for pnate-03:

      http://people.redhat.com/mpatalan/.netapp/pnate-client-03_fc_fcnvme.4.18.0-526.el8.txt.gz

      Ewan looked at the issue and provided the following info:

      This looks like the problem:

      Nov 29 22:56:34 pnate-client-03.sqe.lab.eng.bos.redhat.com sh[100597]:
      Get discovery log page failed: -11
      Nov 29 22:56:34 pnate-client-03.sqe.lab.eng.bos.redhat.com systemd[1]:
      nvmf-connect@-device\x3dnone\ttransport\x3dfc\ttraddr\x3dnn-0x211600a098cbcac6:pn-0x210f00a098cbcac6\ttrsvcid\x3dnone\t-host-traddr\
      \x3dnn-0x2000f4c7aa065db4:pn-0x2100f4c7aa065db4.service: Main process
      exited, code=exited, status=11/n/a
      Nov 29 22:56:34 pnate-client-03.sqe.lab.eng.bos.redhat.com systemd[1]:
      nvmf-connect@-device\x3dnone\ttransport\x3dfc\ttraddr\x3dnn-0x211600a098cbcac6:pn-0x210f00a098cbcac6\ttrsvcid\x3dnone\t-host-traddr\
      \x3dnn-0x2000f4c7aa065db4:pn-0x2100f4c7aa065db4.service: Failed with
      result 'exit-code'.

      In the other places in the log, we see the instantiation of 2
      discovery controllers,
      followed by the instantiation of the nvme controllers to access the subsystem.

      Around this time, though, we do not see the second controller instantiated.
      It looks like the nvme-cli command failed, or something.  The exited, status=11
      seems to be the -11 from the get discovery log page error earlier.  11 is EAGAIN
      which is not generated by the NVMe/FC code.  However...

      commit 3e8721c6f1216aeb6fcd64cd61a86a8176308d3d
      Author: Nilesh Javali <njavali@redhat.com>
      Date:   Mon Sep 18 10:51:12 2023 +0000

          scsi: qla2xxx: Fix error code in qla2x00_start_sp()

          JIRA: https://issues.redhat.com/browse/RHEL-9859

          Upstream Status: git://git.kernel.org/pub/scm/linux/kernel/git/mkp/scsi.git

          commit e579b007eff3ff8d29d59d16214cd85fb9e573f7
          Author: Dan Carpenter <dan.carpenter@linaro.org>
          Date:   Mon Jun 26 13:58:47 2023 +0300

              scsi: qla2xxx: Fix error code in qla2x00_start_sp()

              This should be negative -EAGAIN instead of positive.  The callers treat
              non-zero error codes the same so it doesn't really impact runtime beyond
              some trivial differences to debug output.

              Fixes: 80676d054e5a ("scsi: qla2xxx: Fix session cleanup hang")
              Signed-off-by: Dan Carpenter <dan.carpenter@linaro.org>
              Link: https://lore.kernel.org/r/49866d28-4cfe-47b0-842b-78f110e61aab@moroto.mountain
              Signed-off-by: Martin K. Petersen <martin.petersen@oracle.com>

          Signed-off-by: Nilesh Javali <njavali@redhat.com>

      diff --git a/drivers/scsi/qla2xxx/qla_iocb.c b/drivers/scsi/qla2xxx/qla_iocb.c
      index 18409ada0fff..d8d27c1e182c 100644
      — a/drivers/scsi/qla2xxx/qla_iocb.c
      +++ b/drivers/scsi/qla2xxx/qla_iocb.c
      @@ -3913,7 +3913,7 @@ qla2x00_start_sp(srb_t *sp)

              pkt = __qla2x00_alloc_iocbs(sp->qpair, sp);
              if (!pkt) {
      -               rval = EAGAIN;
      +               rval = -EAGAIN;
                      ql_log(ql_log_warn, vha, 0x700c,
                          "qla2x00_alloc_iocbs failed.\n");

      How reproducible: Once

      Please provide the package NVR for which bug is seen:

      RHEL-8.10.0-20231121.1
      kernel-4.18.0-526.el8

      Steps to reproduce

      1. run pNATE for NVMe-FC/FC

       

              njavali Nilesh Javali
              mpatalan Marco Patalano
              Nilesh Javali Nilesh Javali
              storage-qe storage-qe
              Votes:
              0 Vote for this issue
              Watchers:
              6 Start watching this issue

                Created:
                Updated:
                Resolved: