Uploaded image for project: 'RHEL'
  1. RHEL
  2. RHEL-7529

Updated nvme-cli incompatible with RHEL 9.0 kernel

    • Icon: Bug Bug
    • Resolution: Not a Bug
    • Icon: Undefined Undefined
    • None
    • rhel-9.0.0
    • nvme-cli
    • Yes
    • None
    • Regression
    • rhel-sst-storage-io
    • ssg_filesystems_storage_and_HA
    • None
    • False
    • Hide

      None

      Show
      None
    • None
    • Red Hat Enterprise Linux
    • None
    • None
    • None
    • x86_64
    • None

      What were you trying to do that didn't work?

      Update nvme-cli to the latest version available on RHEL 9.0 and create a NVMe over Fabrics connection

      Please provide the package NVR for which bug is seen:

      nvme-cli-2.2.1-4.el9_2.x86_64

      libnvme-1.2-3.el9_2.x86_64

      How reproducible:

      Consistently

      Steps to reproduce

      1. Use RHEL 9.0
      2. Update the nvme-cli package (dnf update nvme-cli)
      3. Try to create a NVMe over Fabrics connection (nvme discover, nvme connect, etc.)

      Expected results

      Connection should be successfully created, as it is with the version of nvme-cli originally installed (1.16):

      [root@init48-13 ~]# nvme discover --transport tcp --traddr 192.168.5.165
      Discovery Log Number of Records 8, Generation counter 505
      =====Discovery Log Entry 0======
      ...
      

      Actual results

      The kernel rejects the connect options nvme-cli/libnvme provides and refuses to connect the NVMe controller. This is the output from nvme-cli:

      [root@init48-13 ~]# nvme discover --transport tcp --traddr 192.168.5.165
      failed to add controller, error Unknown error -1

      The dmesg logs show the kernel is complaining about a missing transport parameter:

      [root@init48-13 ~]# dmesg | grep nvme
      [311921.660879] nvme_fabrics: missing parameter 'transport=%s'

      Tracing the system calls shows that nvme-cli fails to read from /dev/nvme-fabrics (to get the supported NVMe connect options) and then it writes a connect string missing the transport and traddr parameters:

      [root@init48-13 ~]# strace -s 1000 nvme discover --transport tcp
      --traddr 192.168.5.165
      ...
      openat(AT_FDCWD, "/dev/nvme-fabrics", O_RDONLY) = 803
      read(803, 0x7fffca3278a0, 4095)         = -1 EINVAL (Invalid argument)
      close(803)                              = 0
      openat(AT_FDCWD, "/dev/nvme-fabrics", O_RDWR) = 803
      write(803, "nqn=nqn.2014-08.org.nvmexpress.discovery", 40) = -1 EINVAL
      (Invalid argument)
      close(803)

      Here is the kernel version:

      [root@init48-13 ~]# uname -a
      Linux init48-13 5.14.0-70.22.1.el9_0.x86_64 #1 SMP PREEMPT Tue Aug 2
      10:02:12 EDT 2022 x86_64 x86_64 x86_64 GNU/Linux
      [root@init48-13 ~]# cat /etc/redhat-release
      Red Hat Enterprise Linux release 9.0 (Plow)

      Notably the same version of the nvme-cli/libnvme package works fine on a different machine running RHEL 9.2:

      [root@init48-18 ~]# nvme version
      nvme version 2.2.1 (git 2.2.1)
      libnvme version 1.2 (git 1.2)
      [root@init48-18 ~]# uname -a
      Linux init48-18 5.14.0-284.25.1.el9_2.x86_64 #1 SMP PREEMPT_DYNAMIC
      Thu Jul 20 09:11:28 EDT 2023 x86_64 x86_64 x86_64 GNU/Linux
      [root@init48-18 ~]# cat /etc/redhat-release
      Red Hat Enterprise Linux release 9.2 (Plow)
      [root@init48-18 ~]# nvme discover --transport tcp --traddr 192.168.5.165
      Discovery Log Number of Records 8, Generation counter 17
      =====Discovery Log Entry 0======
      ...
      

      The critical difference in the strace on that machine appears to be that reading the connection options succeeds. The connection string then has all the expected parameters:

      openat(AT_FDCWD, "/dev/nvme-fabrics", O_RDONLY) = 1603
      read(1603, "instance=-1,cntlid=-1,transport=%s,traddr=%s,trsvcid=%s,nqn=%s,queue_size=%d,nr_io_queues=%d,reconnect_delay=%d,ctrl_loss_tmo=%d,keep_alive_tmo=%d,hostnqn=%s,host_traddr=%s,host_iface=%s,hostid=%s,duplicate_connect,disable_sqflow,hdr_digest,data_digest,nr_write_queues=%d,nr_poll_queues=%d,tos=%d,fast_io_fail_tmo=%d,discovery,dhchap_secret=%s,dhchap_ctrl_secret=%s\n",
      4095) = 366
      close(1603)                             = 0
      openat(AT_FDCWD, "/dev/nvme-fabrics", O_RDWR) = 1603
      write(1603, "nqn=nqn.2014-08.org.nvmexpress.discovery,transport=tcp,traddr=192.168.5.165,trsvcid=8009,hostnqn=nqn.2014-08.org.nvmexpress:uuid:4c4c4544-0047-3610-8048-c4c04f535731,hostid=dbb3a987-e7a6-4d39-9af3-f834b4e7c784,ctrl_loss_tmo=600",
      227) = 227
      read(1603, "instance=8,cntlid=0\n", 4095) = 20
      close(1603)                             = 0

       

      I can't find any code in upstream libnvme version 1.2 that tries to read the connect options from /dev/nvme-fabrics. My only explanation is that Red Hat has cherry-picked the change in https://github.com/linux-nvme/libnvme/pull/618, which breaks on
      kernels that don't report NVMe connect options (which the RHEL 9.2 kernel apparently does but the 9.0 kernel doesn't). I fixed this bug upstream https://github.com/linux-nvme/libnvme/pull/643, but it looks like that fix hasn't been applied.

      Can you clarify whether this version of nvme-cli/libnvme is considered supported on RHEL 9.0? (I would assume so, since it was installed automatically when we requested to update the package.) If not, we can tell our customers to avoid updating to it, or to avoid RHEL 9 releases less than 9.2 entirely. If it is supposed to be supported, then it sounds like it may be necessary to cherry-pick https://github.com/linux-nvme/libnvme/pull/643 to your build as well.

              mlombard@redhat.com Maurizio Lombardi
              csander650 Caleb Sander (Inactive)
              Pure Storage Confidential Group
              Maurizio Lombardi Maurizio Lombardi
              Yi Zhang Yi Zhang
              Votes:
              0 Vote for this issue
              Watchers:
              6 Start watching this issue

                Created:
                Updated:
                Resolved: