-
Bug
-
Resolution: Unresolved
-
Normal
-
None
-
None
-
None
-
Moderate
-
rhel-sst-storage-io
-
ssg_filesystems_storage_and_HA
-
5
-
False
-
-
None
-
None
-
None
-
None
-
x86_64
-
None
What were you trying to do that didn't work?
While testing RHEL 9.3 using NVMe/FC paths to our E-Series Storage Array we have encountered scenario's where NVMe/FC paths fail to return properly. This most commonly seems to occur during a port "bounce" in which a port on the array is rapidly shut down and re-enabled. Most of the path is quickly marked failed and recovered, but occasionally we see the nvme path say it has "failed to reset" and never comes back. Here are the logs from /var/log/messages when this occurs:
Nov 30 05:40:14 ictam08s01h04.ict.englab.netapp.com kernel: nvme nvme1: NVME-FC
{1}: io failed due to lldd error 6Nov 30 05:40:14 ictam08s01h04.ict.englab.netapp.com kernel: qla2xxx [0000:24:00.1]-2112:10: qla_nvme_unregister_remote_port: unregister remoteport on 00000000898531fc 2042d039ea44c86d
Nov 30 05:40:14 ictam08s01h04.ict.englab.netapp.com kernel: nvme nvme1: NVME-FC{1}
: transport association event: transport detected io error
Nov 30 05:40:14 ictam08s01h04.ict.englab.netapp.com kernel: nvme nvme1: NVME-FC
{1}
: resetting controller
Nov 30 05:40:14 ictam08s01h04.ict.englab.netapp.com kernel: nvme nvme1: NVME-FC{1}
: controller connectivity lost. Awaiting Reconnect
Nov 30 05:40:14 ictam08s01h04.ict.englab.netapp.com kernel: nvme nvme1: NVME-FC
: Couldn't schedule reset.
Nov 30 05:40:14 ictam08s01h04.ict.englab.netapp.com kernel: nvme nvme1: Removing ctrl: NQN "nqn.1992-08.com.netapp:3000.6d039ea00044c85d00000000627b7b10"
Nov 30 05:40:14 ictam08s01h04.ict.englab.netapp.com kernel: qla2xxx [0000:24:00.1]-2110:10: remoteport_delete of 00000000898531fc 2042d039ea44c86d completed.
Nov 30 05:40:14 ictam08s01h04.ict.englab.netapp.com kernel: nvme nvme1: NVME-FC{1}
: error_recovery: Couldn't change state to CONNECTING
Nov 30 05:40:14 ictam08s01h04.ict.englab.netapp.com kernel: qla2xxx [0000:24:00.1]-3002:10: nvme: Sched: Set ZIO exchange threshold to 0.
Nov 30 05:40:14 ictam08s01h04.ict.englab.netapp.com kernel: qla2xxx [0000:24:00.1]-11a2:10: FEC=enabled (data rate).
Nov 30 05:40:14 ictam08s01h04.ict.englab.netapp.com kernel: qla2xxx [0000:24:00.1]-ffffff:10: SET ZIO Activity exchange threshold to 5.
Nov 30 05:40:14 ictam08s01h04.ict.englab.netapp.com kernel: qla2xxx [0000:24:00.1]-2102:10: qla_nvme_register_remote: traddr=nn-0x2002d039ea44c86d:pn-0x2042d039ea44c86d PortID:000002
Nov 30 05:40:14 ictam08s01h04.ict.englab.netapp.com systemd[1]: Started NVMf auto-connect scan upon nvme discovery controller Events.
Nov 30 05:40:14 ictam08s01h04.ict.englab.netapp.com kernel: nvme nvme0: NVME-FC
Nov 30 05:40:14 ictam08s01h04.ict.englab.netapp.com kernel: qla2xxx [0000:24:00.1]-2104:10: qla_nvme_alloc_queue: handle 00000000fbd0c857, idx =0, qsize 32
Nov 30 05:40:14 ictam08s01h04.ict.englab.netapp.com kernel: qla2xxx [0000:24:00.1]-2121:10: Returning existing qpair of 000000004afacf3e for idx=0
Nov 30 05:40:14 ictam08s01h04.ict.englab.netapp.com kernel: nvme nvme0: queue_size 128 > ctrl maxcmd 32, reducing to maxcmd
Nov 30 05:40:14 ictam08s01h04.ict.englab.netapp.com kernel: nvme nvme0: NVME-FC{0}
: controller connect complete
Nov 30 05:40:14 ictam08s01h04.ict.englab.netapp.com kernel: nvme nvme0: NVME-FC
{0}
: new ctrl: NQN "nqn.2014-08.org.nvmexpress.discovery"
Nov 30 05:40:14 ictam08s01h04.ict.englab.netapp.com kernel: nvme_fc: nvme_fc_create_ctrl: nn-0x2002d039ea44c86d:pn-0x2012d039ea44c86d - nn-0x20000024ff7d3d7d:pn-0x21000024ff7d3d7d combination not found
Nov 30 05:40:14 ictam08s01h04.ict.englab.netapp.com kernel: nvme_fc: nvme_fc_create_ctrl: nn-0x2002d039ea44c86d:pn-0x2022d039ea44c86d - nn-0x20000024ff7d3d7d:pn-0x21000024ff7d3d7d combination not found
Nov 30 05:40:14 ictam08s01h04.ict.englab.netapp.com kernel: nvme_fc: nvme_fc_create_ctrl: nn-0x2002d039ea44c86d:pn-0x2032d039ea44c86d - nn-0x20000024ff7d3d7d:pn-0x21000024ff7d3d7d combination not found
Nov 30 05:40:14 ictam08s01h04.ict.englab.netapp.com kernel: nvme_fc: nvme_fc_create_ctrl: nn-0x2002d039ea44c86d:pn-0x2013d039ea44c86d - nn-0x20000024ff7d3d7d:pn-0x21000024ff7d3d7d combination not found
Nov 30 05:40:14 ictam08s01h04.ict.englab.netapp.com kernel: nvme_fc: nvme_fc_create_ctrl: nn-0x2002d039ea44c86d:pn-0x2023d039ea44c86d - nn-0x20000024ff7d3d7d:pn-0x21000024ff7d3d7d combination not found
Nov 30 05:40:14 ictam08s01h04.ict.englab.netapp.com kernel: nvme_fc: nvme_fc_create_ctrl: nn-0x2002d039ea44c86d:pn-0x2033d039ea44c86d - nn-0x20000024ff7d3d7d:pn-0x21000024ff7d3d7d combination not found
Nov 30 05:40:14 ictam08s01h04.ict.englab.netapp.com kernel: nvme_fc: nvme_fc_create_ctrl: nn-0x2002d039ea44c86d:pn-0x2043d039ea44c86d - nn-0x20000024ff7d3d7d:pn-0x21000024ff7d3d7d combination not found
Nov 30 05:40:14 ictam08s01h04.ict.englab.netapp.com kernel: nvme nvme0: Removing ctrl: NQN "nqn.2014-08.org.nvmexpress.discovery"
Nov 30 05:40:14 ictam08s01h04.ict.englab.netapp.com systemd[1]: nvmf-connect@-device\x3dnone\ttransport\x3dfc\ttraddr\x3dnn-0x2002d039ea44c86d:pn-0x2042d039ea44c86d\ttrsvcid\x3dnone\t-host-traddr\x3dnn-0x20000024ff7d3d7d:pn-0x21000024ff7d3d7d.service: Deactivated successful
We actually have 4x hosts in this configuration. Host1 is fabric attached Broadcom cards, Host2 is Fabric attached Qlogic cards, H3 is direct attached Broadcom cards and H4 is direct attach Qlogic cards. We have only been able to hit this on H4, though we have hit it on both paths. This has us curious if whatever this issue is is perhaps related to a direct attach Qlogic environment specifically.
Here are the specifics of the Qlogic Cards used on this host currently:
1x port QLE2742 w/ 9.10.11 fw
1x port QLE2772 w/ 9.10.11 fw
Again, both ports have experienced this issue, the above logs are from a single reproduction.
Can Redhat assist us in a proper root causing of this issue? Possibly with the collection of additional logging or tracing?
Please provide the package NVR for which bug is seen:
This was encountered with RHEL 9.3 GA:
kernel-5.14.0-316.el9.x86_64
nvme-cli-2.4-10.el9.x86_64
How reproducible:
This is not seen on every port bounce but is still relatively reproducible. Can easily be reproduced after a handful of attempts.
Steps to reproduce
- Install RHEL 9.3 GA
- Setup NVMe/FC connections to a NetApp target
- Run IO to NetApp Target
- Bounce ports on the NetApp Target until a path fails to return
Expected results
All paths return on every port bounce
Actual results
Eventually a path fails to recover properly after a port bounce.