Loading...

Type: Bug
Resolution: Unresolved
Priority: Normal
Fix Version/s: None
Affects Version/s: None
Component/s: nvme-cli
Labels:
- NVME_NETAPP
- NVME_P1

Regression:
None
Severity:
Moderate

Pool Team:

rhel-sst-storage-io
Sub-System Group:

ssg_filesystems_storage_and_HA

Story Points:
5
Blocked:
False
Blocked Reason:

Hide

None

Show
None
Product Documentation Required:
None
Sprint:
None

Preliminary Testing:
None
Test Coverage:
None

Architecture:

x86_64

SFDC Cases Counter:
SFDC Cases Open:
SFDC Cases Links:

Planning:
None

What were you trying to do that didn't work?

While testing RHEL 9.3 using NVMe/FC paths to our E-Series Storage Array we have encountered scenario's where NVMe/FC paths fail to return properly. This most commonly seems to occur during a port "bounce" in which a port on the array is rapidly shut down and re-enabled. Most of the path is quickly marked failed and recovered, but occasionally we see the nvme path say it has "failed to reset" and never comes back. Here are the logs from /var/log/messages when this occurs:

Nov 30 05:40:14 ictam08s01h04.ict.englab.netapp.com kernel: nvme nvme1: NVME-FC

{1}: io failed due to lldd error 6
Nov 30 05:40:14 ictam08s01h04.ict.englab.netapp.com kernel: qla2xxx [0000:24:00.1]-2112:10: qla_nvme_unregister_remote_port: unregister remoteport on 00000000898531fc 2042d039ea44c86d
Nov 30 05:40:14 ictam08s01h04.ict.englab.netapp.com kernel: nvme nvme1: NVME-FC{1}

: transport association event: transport detected io error
Nov 30 05:40:14 ictam08s01h04.ict.englab.netapp.com kernel: nvme nvme1: NVME-FC
{1}

: resetting controller
Nov 30 05:40:14 ictam08s01h04.ict.englab.netapp.com kernel: nvme nvme1: NVME-FC{1}

: controller connectivity lost. Awaiting Reconnect
Nov 30 05:40:14 ictam08s01h04.ict.englab.netapp.com kernel: nvme nvme1: NVME-FC

{1}

: Couldn't schedule reset.
Nov 30 05:40:14 ictam08s01h04.ict.englab.netapp.com kernel: nvme nvme1: Removing ctrl: NQN "nqn.1992-08.com.netapp:3000.6d039ea00044c85d00000000627b7b10"
Nov 30 05:40:14 ictam08s01h04.ict.englab.netapp.com kernel: qla2xxx [0000:24:00.1]-2110:10: remoteport_delete of 00000000898531fc 2042d039ea44c86d completed.
Nov 30 05:40:14 ictam08s01h04.ict.englab.netapp.com kernel: nvme nvme1: NVME-FC{1}

: error_recovery: Couldn't change state to CONNECTING
Nov 30 05:40:14 ictam08s01h04.ict.englab.netapp.com kernel: qla2xxx [0000:24:00.1]-3002:10: nvme: Sched: Set ZIO exchange threshold to 0.
Nov 30 05:40:14 ictam08s01h04.ict.englab.netapp.com kernel: qla2xxx [0000:24:00.1]-11a2:10: FEC=enabled (data rate).
Nov 30 05:40:14 ictam08s01h04.ict.englab.netapp.com kernel: qla2xxx [0000:24:00.1]-ffffff:10: SET ZIO Activity exchange threshold to 5.
Nov 30 05:40:14 ictam08s01h04.ict.englab.netapp.com kernel: qla2xxx [0000:24:00.1]-2102:10: qla_nvme_register_remote: traddr=nn-0x2002d039ea44c86d:pn-0x2042d039ea44c86d PortID:000002
Nov 30 05:40:14 ictam08s01h04.ict.englab.netapp.com systemd[1]: Started NVMf auto-connect scan upon nvme discovery controller Events.
Nov 30 05:40:14 ictam08s01h04.ict.englab.netapp.com kernel: nvme nvme0: NVME-FC

{0}: create association : host wwpn 0x21000024ff7d3d7d rport wwpn 0x2042d039ea44c86d: NQN "nqn.2014-08.org.nvmexpress.discovery"
Nov 30 05:40:14 ictam08s01h04.ict.englab.netapp.com kernel: qla2xxx [0000:24:00.1]-2104:10: qla_nvme_alloc_queue: handle 00000000fbd0c857, idx =0, qsize 32
Nov 30 05:40:14 ictam08s01h04.ict.englab.netapp.com kernel: qla2xxx [0000:24:00.1]-2121:10: Returning existing qpair of 000000004afacf3e for idx=0
Nov 30 05:40:14 ictam08s01h04.ict.englab.netapp.com kernel: nvme nvme0: queue_size 128 > ctrl maxcmd 32, reducing to maxcmd
Nov 30 05:40:14 ictam08s01h04.ict.englab.netapp.com kernel: nvme nvme0: NVME-FC{0}

: controller connect complete
Nov 30 05:40:14 ictam08s01h04.ict.englab.netapp.com kernel: nvme nvme0: NVME-FC
{0}

: new ctrl: NQN "nqn.2014-08.org.nvmexpress.discovery"
Nov 30 05:40:14 ictam08s01h04.ict.englab.netapp.com kernel: nvme_fc: nvme_fc_create_ctrl: nn-0x2002d039ea44c86d:pn-0x2012d039ea44c86d - nn-0x20000024ff7d3d7d:pn-0x21000024ff7d3d7d combination not found
Nov 30 05:40:14 ictam08s01h04.ict.englab.netapp.com kernel: nvme_fc: nvme_fc_create_ctrl: nn-0x2002d039ea44c86d:pn-0x2022d039ea44c86d - nn-0x20000024ff7d3d7d:pn-0x21000024ff7d3d7d combination not found
Nov 30 05:40:14 ictam08s01h04.ict.englab.netapp.com kernel: nvme_fc: nvme_fc_create_ctrl: nn-0x2002d039ea44c86d:pn-0x2032d039ea44c86d - nn-0x20000024ff7d3d7d:pn-0x21000024ff7d3d7d combination not found
Nov 30 05:40:14 ictam08s01h04.ict.englab.netapp.com kernel: nvme_fc: nvme_fc_create_ctrl: nn-0x2002d039ea44c86d:pn-0x2013d039ea44c86d - nn-0x20000024ff7d3d7d:pn-0x21000024ff7d3d7d combination not found
Nov 30 05:40:14 ictam08s01h04.ict.englab.netapp.com kernel: nvme_fc: nvme_fc_create_ctrl: nn-0x2002d039ea44c86d:pn-0x2023d039ea44c86d - nn-0x20000024ff7d3d7d:pn-0x21000024ff7d3d7d combination not found
Nov 30 05:40:14 ictam08s01h04.ict.englab.netapp.com kernel: nvme_fc: nvme_fc_create_ctrl: nn-0x2002d039ea44c86d:pn-0x2033d039ea44c86d - nn-0x20000024ff7d3d7d:pn-0x21000024ff7d3d7d combination not found
Nov 30 05:40:14 ictam08s01h04.ict.englab.netapp.com kernel: nvme_fc: nvme_fc_create_ctrl: nn-0x2002d039ea44c86d:pn-0x2043d039ea44c86d - nn-0x20000024ff7d3d7d:pn-0x21000024ff7d3d7d combination not found
Nov 30 05:40:14 ictam08s01h04.ict.englab.netapp.com kernel: nvme nvme0: Removing ctrl: NQN "nqn.2014-08.org.nvmexpress.discovery"
Nov 30 05:40:14 ictam08s01h04.ict.englab.netapp.com systemd[1]: nvmf-connect@-device\x3dnone\ttransport\x3dfc\ttraddr\x3dnn-0x2002d039ea44c86d:pn-0x2042d039ea44c86d\ttrsvcid\x3dnone\t-host-traddr\x3dnn-0x20000024ff7d3d7d:pn-0x21000024ff7d3d7d.service: Deactivated successful

We actually have 4x hosts in this configuration. Host1 is fabric attached Broadcom cards, Host2 is Fabric attached Qlogic cards, H3 is direct attached Broadcom cards and H4 is direct attach Qlogic cards. We have only been able to hit this on H4, though we have hit it on both paths. This has us curious if whatever this issue is is perhaps related to a direct attach Qlogic environment specifically.

Here are the specifics of the Qlogic Cards used on this host currently:
1x port QLE2742 w/ 9.10.11 fw
1x port QLE2772 w/ 9.10.11 fw

Again, both ports have experienced this issue, the above logs are from a single reproduction.

Can Redhat assist us in a proper root causing of this issue? Possibly with the collection of additional logging or tracing?

Please provide the package NVR for which bug is seen:

This was encountered with RHEL 9.3 GA:
kernel-5.14.0-316.el9.x86_64
nvme-cli-2.4-10.el9.x86_64

How reproducible:

This is not seen on every port bounce but is still relatively reproducible. Can easily be reproduced after a handful of attempts.

Steps to reproduce

Install RHEL 9.3 GA
Setup NVMe/FC connections to a NetApp target
Run IO to NetApp Target
Bounce ports on the NetApp Target until a path fails to return

Expected results

All paths return on every port bounce

Actual results

Eventually a path fails to recover properly after a port bounce.

Details

Description

What were you trying to do that didn't work?

Please provide the package NVR for which bug is seen:

How reproducible:

Steps to reproduce

Expected results

Actual results

Attachments

Easy Agile Planning Poker

Activity

People

Dates