Uploaded image for project: 'RHEL'
  1. RHEL
  2. RHEL-18181

RHEL 9.3 GA experiences NVMe/FC paths occasionally entering an unrecoverable state of "NVME-FC{1}: resetting controller" followed by "NVME-FC{1}: Couldn't schedule reset" during path failure/recovery

    • Icon: Bug Bug
    • Resolution: Unresolved
    • Icon: Normal Normal
    • None
    • None
    • nvme-cli
    • None
    • Moderate
    • sst_storage_io
    • ssg_platform_storage
    • 5
    • False
    • Hide

      None

      Show
      None
    • None
    • None
    • None
    • None
    • x86_64
    • None

      What were you trying to do that didn't work?

      While testing RHEL 9.3 using NVMe/FC paths to our E-Series Storage Array we have encountered scenario's where NVMe/FC paths fail to return properly. This most commonly seems to occur during a port "bounce" in which a port on the array is rapidly shut down and re-enabled. Most of the path is quickly marked failed and recovered, but occasionally we see the nvme path say it has "failed to reset" and never comes back. Here are the logs from /var/log/messages when this occurs:

      Nov 30 05:40:14 ictam08s01h04.ict.englab.netapp.com kernel: nvme nvme1: NVME-FC

      {1}: io failed due to lldd error 6
      Nov 30 05:40:14 ictam08s01h04.ict.englab.netapp.com kernel: qla2xxx [0000:24:00.1]-2112:10: qla_nvme_unregister_remote_port: unregister remoteport on 00000000898531fc 2042d039ea44c86d
      Nov 30 05:40:14 ictam08s01h04.ict.englab.netapp.com kernel: nvme nvme1: NVME-FC{1}

      : transport association event: transport detected io error
      Nov 30 05:40:14 ictam08s01h04.ict.englab.netapp.com kernel: nvme nvme1: NVME-FC
      {1}

      : resetting controller
      Nov 30 05:40:14 ictam08s01h04.ict.englab.netapp.com kernel: nvme nvme1: NVME-FC{1}

      : controller connectivity lost. Awaiting Reconnect
      Nov 30 05:40:14 ictam08s01h04.ict.englab.netapp.com kernel: nvme nvme1: NVME-FC

      {1}

      : Couldn't schedule reset.
      Nov 30 05:40:14 ictam08s01h04.ict.englab.netapp.com kernel: nvme nvme1: Removing ctrl: NQN "nqn.1992-08.com.netapp:3000.6d039ea00044c85d00000000627b7b10"
      Nov 30 05:40:14 ictam08s01h04.ict.englab.netapp.com kernel: qla2xxx [0000:24:00.1]-2110:10: remoteport_delete of 00000000898531fc 2042d039ea44c86d completed.
      Nov 30 05:40:14 ictam08s01h04.ict.englab.netapp.com kernel: nvme nvme1: NVME-FC{1}

      : error_recovery: Couldn't change state to CONNECTING
      Nov 30 05:40:14 ictam08s01h04.ict.englab.netapp.com kernel: qla2xxx [0000:24:00.1]-3002:10: nvme: Sched: Set ZIO exchange threshold to 0.
      Nov 30 05:40:14 ictam08s01h04.ict.englab.netapp.com kernel: qla2xxx [0000:24:00.1]-11a2:10: FEC=enabled (data rate).
      Nov 30 05:40:14 ictam08s01h04.ict.englab.netapp.com kernel: qla2xxx [0000:24:00.1]-ffffff:10: SET ZIO Activity exchange threshold to 5.
      Nov 30 05:40:14 ictam08s01h04.ict.englab.netapp.com kernel: qla2xxx [0000:24:00.1]-2102:10: qla_nvme_register_remote: traddr=nn-0x2002d039ea44c86d:pn-0x2042d039ea44c86d PortID:000002
      Nov 30 05:40:14 ictam08s01h04.ict.englab.netapp.com systemd[1]: Started NVMf auto-connect scan upon nvme discovery controller Events.
      Nov 30 05:40:14 ictam08s01h04.ict.englab.netapp.com kernel: nvme nvme0: NVME-FC

      {0}: create association : host wwpn 0x21000024ff7d3d7d rport wwpn 0x2042d039ea44c86d: NQN "nqn.2014-08.org.nvmexpress.discovery"
      Nov 30 05:40:14 ictam08s01h04.ict.englab.netapp.com kernel: qla2xxx [0000:24:00.1]-2104:10: qla_nvme_alloc_queue: handle 00000000fbd0c857, idx =0, qsize 32
      Nov 30 05:40:14 ictam08s01h04.ict.englab.netapp.com kernel: qla2xxx [0000:24:00.1]-2121:10: Returning existing qpair of 000000004afacf3e for idx=0
      Nov 30 05:40:14 ictam08s01h04.ict.englab.netapp.com kernel: nvme nvme0: queue_size 128 > ctrl maxcmd 32, reducing to maxcmd
      Nov 30 05:40:14 ictam08s01h04.ict.englab.netapp.com kernel: nvme nvme0: NVME-FC{0}

      : controller connect complete
      Nov 30 05:40:14 ictam08s01h04.ict.englab.netapp.com kernel: nvme nvme0: NVME-FC
      {0}

      : new ctrl: NQN "nqn.2014-08.org.nvmexpress.discovery"
      Nov 30 05:40:14 ictam08s01h04.ict.englab.netapp.com kernel: nvme_fc: nvme_fc_create_ctrl: nn-0x2002d039ea44c86d:pn-0x2012d039ea44c86d - nn-0x20000024ff7d3d7d:pn-0x21000024ff7d3d7d combination not found
      Nov 30 05:40:14 ictam08s01h04.ict.englab.netapp.com kernel: nvme_fc: nvme_fc_create_ctrl: nn-0x2002d039ea44c86d:pn-0x2022d039ea44c86d - nn-0x20000024ff7d3d7d:pn-0x21000024ff7d3d7d combination not found
      Nov 30 05:40:14 ictam08s01h04.ict.englab.netapp.com kernel: nvme_fc: nvme_fc_create_ctrl: nn-0x2002d039ea44c86d:pn-0x2032d039ea44c86d - nn-0x20000024ff7d3d7d:pn-0x21000024ff7d3d7d combination not found
      Nov 30 05:40:14 ictam08s01h04.ict.englab.netapp.com kernel: nvme_fc: nvme_fc_create_ctrl: nn-0x2002d039ea44c86d:pn-0x2013d039ea44c86d - nn-0x20000024ff7d3d7d:pn-0x21000024ff7d3d7d combination not found
      Nov 30 05:40:14 ictam08s01h04.ict.englab.netapp.com kernel: nvme_fc: nvme_fc_create_ctrl: nn-0x2002d039ea44c86d:pn-0x2023d039ea44c86d - nn-0x20000024ff7d3d7d:pn-0x21000024ff7d3d7d combination not found
      Nov 30 05:40:14 ictam08s01h04.ict.englab.netapp.com kernel: nvme_fc: nvme_fc_create_ctrl: nn-0x2002d039ea44c86d:pn-0x2033d039ea44c86d - nn-0x20000024ff7d3d7d:pn-0x21000024ff7d3d7d combination not found
      Nov 30 05:40:14 ictam08s01h04.ict.englab.netapp.com kernel: nvme_fc: nvme_fc_create_ctrl: nn-0x2002d039ea44c86d:pn-0x2043d039ea44c86d - nn-0x20000024ff7d3d7d:pn-0x21000024ff7d3d7d combination not found
      Nov 30 05:40:14 ictam08s01h04.ict.englab.netapp.com kernel: nvme nvme0: Removing ctrl: NQN "nqn.2014-08.org.nvmexpress.discovery"
      Nov 30 05:40:14 ictam08s01h04.ict.englab.netapp.com systemd[1]: nvmf-connect@-device\x3dnone\ttransport\x3dfc\ttraddr\x3dnn-0x2002d039ea44c86d:pn-0x2042d039ea44c86d\ttrsvcid\x3dnone\t-host-traddr\x3dnn-0x20000024ff7d3d7d:pn-0x21000024ff7d3d7d.service: Deactivated successful

      We actually have 4x hosts in this configuration. Host1 is fabric attached Broadcom cards, Host2 is Fabric attached Qlogic cards, H3 is direct attached Broadcom cards and H4 is direct attach Qlogic cards. We have only been able to hit this on H4, though we have hit it on both paths. This has us curious if whatever this issue is is perhaps related to a direct attach Qlogic environment specifically.

      Here are the specifics of the Qlogic Cards used on this host currently:
      1x port QLE2742 w/ 9.10.11 fw
      1x port QLE2772 w/ 9.10.11 fw

      Again, both ports have experienced this issue, the above logs are from a single reproduction.

      Can Redhat assist us in a proper root causing of this issue? Possibly with the collection of additional logging or tracing?

      Please provide the package NVR for which bug is seen:

      This was encountered with RHEL 9.3 GA:
      kernel-5.14.0-316.el9.x86_64
      nvme-cli-2.4-10.el9.x86_64

      How reproducible:

      This is not seen on every port bounce but is still relatively reproducible. Can easily be reproduced after a handful of attempts.

      Steps to reproduce

      1.  Install RHEL 9.3 GA
      2.  Setup NVMe/FC connections to a NetApp target
      3.  Run IO to NetApp Target
      4. Bounce ports on the NetApp Target until a path fails to return

      Expected results

      All paths return on every port bounce

      Actual results

      Eventually a path fails to recover properly after a port bounce.

            njavali Nilesh Javali
            cxskaggs Clayton Skaggs
            NetApp Confidential Group
            Ewan Milne Ewan Milne
            Yi Zhang Yi Zhang
            Votes:
            0 Vote for this issue
            Watchers:
            6 Start watching this issue

              Created:
              Updated: