What were you trying to do that didn't work?
Update nvme-cli to the latest version available on RHEL 9.0 and create a NVMe over Fabrics connection
Please provide the package NVR for which bug is seen:
nvme-cli-2.2.1-4.el9_2.x86_64
libnvme-1.2-3.el9_2.x86_64
How reproducible:
Consistently
Steps to reproduce
- Use RHEL 9.0
- Update the nvme-cli package (dnf update nvme-cli)
- Try to create a NVMe over Fabrics connection (nvme discover, nvme connect, etc.)
Expected results
Connection should be successfully created, as it is with the version of nvme-cli originally installed (1.16):
[root@init48-13 ~]# nvme discover --transport tcp --traddr 192.168.5.165 Discovery Log Number of Records 8, Generation counter 505 =====Discovery Log Entry 0====== ...
Actual results
The kernel rejects the connect options nvme-cli/libnvme provides and refuses to connect the NVMe controller. This is the output from nvme-cli:
[root@init48-13 ~]# nvme discover --transport tcp --traddr 192.168.5.165 failed to add controller, error Unknown error -1
The dmesg logs show the kernel is complaining about a missing transport parameter:
[root@init48-13 ~]# dmesg | grep nvme [311921.660879] nvme_fabrics: missing parameter 'transport=%s'
Tracing the system calls shows that nvme-cli fails to read from /dev/nvme-fabrics (to get the supported NVMe connect options) and then it writes a connect string missing the transport and traddr parameters:
[root@init48-13 ~]# strace -s 1000 nvme discover --transport tcp --traddr 192.168.5.165 ... openat(AT_FDCWD, "/dev/nvme-fabrics", O_RDONLY) = 803 read(803, 0x7fffca3278a0, 4095) = -1 EINVAL (Invalid argument) close(803) = 0 openat(AT_FDCWD, "/dev/nvme-fabrics", O_RDWR) = 803 write(803, "nqn=nqn.2014-08.org.nvmexpress.discovery", 40) = -1 EINVAL (Invalid argument) close(803)
Here is the kernel version:
[root@init48-13 ~]# uname -a Linux init48-13 5.14.0-70.22.1.el9_0.x86_64 #1 SMP PREEMPT Tue Aug 2 10:02:12 EDT 2022 x86_64 x86_64 x86_64 GNU/Linux [root@init48-13 ~]# cat /etc/redhat-release Red Hat Enterprise Linux release 9.0 (Plow)
Notably the same version of the nvme-cli/libnvme package works fine on a different machine running RHEL 9.2:
[root@init48-18 ~]# nvme version nvme version 2.2.1 (git 2.2.1) libnvme version 1.2 (git 1.2) [root@init48-18 ~]# uname -a Linux init48-18 5.14.0-284.25.1.el9_2.x86_64 #1 SMP PREEMPT_DYNAMIC Thu Jul 20 09:11:28 EDT 2023 x86_64 x86_64 x86_64 GNU/Linux [root@init48-18 ~]# cat /etc/redhat-release Red Hat Enterprise Linux release 9.2 (Plow) [root@init48-18 ~]# nvme discover --transport tcp --traddr 192.168.5.165 Discovery Log Number of Records 8, Generation counter 17 =====Discovery Log Entry 0====== ...
The critical difference in the strace on that machine appears to be that reading the connection options succeeds. The connection string then has all the expected parameters:
openat(AT_FDCWD, "/dev/nvme-fabrics", O_RDONLY) = 1603 read(1603, "instance=-1,cntlid=-1,transport=%s,traddr=%s,trsvcid=%s,nqn=%s,queue_size=%d,nr_io_queues=%d,reconnect_delay=%d,ctrl_loss_tmo=%d,keep_alive_tmo=%d,hostnqn=%s,host_traddr=%s,host_iface=%s,hostid=%s,duplicate_connect,disable_sqflow,hdr_digest,data_digest,nr_write_queues=%d,nr_poll_queues=%d,tos=%d,fast_io_fail_tmo=%d,discovery,dhchap_secret=%s,dhchap_ctrl_secret=%s\n", 4095) = 366 close(1603) = 0 openat(AT_FDCWD, "/dev/nvme-fabrics", O_RDWR) = 1603 write(1603, "nqn=nqn.2014-08.org.nvmexpress.discovery,transport=tcp,traddr=192.168.5.165,trsvcid=8009,hostnqn=nqn.2014-08.org.nvmexpress:uuid:4c4c4544-0047-3610-8048-c4c04f535731,hostid=dbb3a987-e7a6-4d39-9af3-f834b4e7c784,ctrl_loss_tmo=600", 227) = 227 read(1603, "instance=8,cntlid=0\n", 4095) = 20 close(1603) = 0
I can't find any code in upstream libnvme version 1.2 that tries to read the connect options from /dev/nvme-fabrics. My only explanation is that Red Hat has cherry-picked the change in https://github.com/linux-nvme/libnvme/pull/618, which breaks on
kernels that don't report NVMe connect options (which the RHEL 9.2 kernel apparently does but the 9.0 kernel doesn't). I fixed this bug upstream https://github.com/linux-nvme/libnvme/pull/643, but it looks like that fix hasn't been applied.
Can you clarify whether this version of nvme-cli/libnvme is considered supported on RHEL 9.0? (I would assume so, since it was installed automatically when we requested to update the package.) If not, we can tell our customers to avoid updating to it, or to avoid RHEL 9 releases less than 9.2 entirely. If it is supposed to be supported, then it sounds like it may be necessary to cherry-pick https://github.com/linux-nvme/libnvme/pull/643 to your build as well.