Uploaded image for project: 'RHEL'
  1. RHEL
  2. RHEL-8124

Request timeout issue with NVMe/TCP under high IO load

    • Icon: Bug Bug
    • Resolution: Obsolete
    • Icon: Normal Normal
    • None
    • rhel-9.0.0
    • libnvme
    • None
    • Moderate
    • sst_storage_io
    • ssg_platform_storage
    • 1
    • False
    • Hide

      None

      Show
      None
    • None
    • None
    • None
    • None
    • If docs needed, set a value
    • None

      Description of problem:

      Request timeout issue with NVMe/TCP under high IO load

      Version-Release number of selected component (if applicable):

      RHEL 9.0 / 5.14.0-70.17.1.el9_0.x86_64

      How reproducible:

      Always

      Steps to Reproduce:
      1. RHEL 9 server connected to a storage product via the NVMe/TCP protocol.
      2. Customer used spdk nvme target
      3. Then start high IO load

      Actual results:
      Nvme-tcp connection issue

      Expected results:
      It should work normally under high IO load

      Additional info:

      ~~~
      [ 121.459789] nvme nvme0: new ctrl: NQN "nqn.2014-08.org.nvmexpress.discovery", addr 172.20.53.135:4420
      [ 121.460298] nvme nvme0: Removing ctrl: NQN "nqn.2014-08.org.nvmexpress.discovery"
      [ 121.469848] nvme nvme0: creating 24 I/O queues.
      [ 121.708457] nvme nvme0: mapped 24/0/0 default/read/poll queues.
      [ 121.715034] nvme nvme0: new ctrl: NQN "nqn.2020-01.com.infinidat:20015-subsystem-19988", addr 172.XX.XX.XXX:4420
      [ 121.730466] nvme nvme1: creating 24 I/O queues.
      [ 121.968453] nvme nvme1: mapped 24/0/0 default/read/poll queues.
      [ 121.974964] nvme nvme1: new ctrl: NQN "nqn.2020-01.com.infinidat:20015-subsystem-19988", addr 172.XX.XX.XX:4420

      [ 2609.736192] nvme nvme0: queue 22: timeout request 0x38 type 4
      [ 2609.736201] nvme nvme0: starting error recovery
      [ 2609.737241] nvme nvme0: Reconnecting in 10 seconds...
      [ 2619.987225] nvme nvme0: creating 24 I/O queues.
      [ 2620.258490] nvme nvme0: mapped 24/0/0 default/read/poll queues.
      [ 2620.263160] nvme nvme0: Successfully reconnected (1 attempt)
      [ 2994.751516] nvme nvme1: queue 8: timeout request 0x6 type 4
      [ 2994.751525] nvme nvme1: starting error recovery
      [ 2994.752560] nvme nvme1: Reconnecting in 10 seconds...
      [ 3005.004597] nvme nvme1: creating 24 I/O queues.
      [ 3005.274160] nvme nvme1: mapped 24/0/0 default/read/poll queues.
      [ 3005.274290] debugfs: Directory 'hctx0' with parent '/' already present!
      [ 3005.274298] debugfs: Directory 'hctx1' with parent '/' already present!
      [ 3005.274302] debugfs: Directory 'hctx2' with parent '/' already present!
      [ 3005.274304] debugfs: Directory 'hctx3' with parent '/' already present!
      [ 3005.274307] debugfs: Directory 'hctx4' with parent '/' already present!
      [ 3005.274310] debugfs: Directory 'hctx5' with parent '/' already present!
      [ 3005.274312] debugfs: Directory 'hctx6' with parent '/' already present!
      [ 3005.274316] debugfs: Directory 'hctx7' with parent '/' already present!
      [ 3005.274319] debugfs: Directory 'hctx8' with parent '/' already present!
      [ 3005.274321] debugfs: Directory 'hctx9' with parent '/' already present!
      [ 3005.274324] debugfs: Directory 'hctx10' with parent '/' already present!
      [ 3005.274327] debugfs: Directory 'hctx11' with parent '/' already present!
      [ 3005.274330] debugfs: Directory 'hctx12' with parent '/' already present!
      [ 3005.274333] debugfs: Directory 'hctx13' with parent '/' already present!
      [ 3005.274336] debugfs: Directory 'hctx14' with parent '/' already present!
      [ 3005.274339] debugfs: Directory 'hctx15' with parent '/' already present!
      [ 3005.274342] debugfs: Directory 'hctx16' with parent '/' already present!
      [ 3005.274344] debugfs: Directory 'hctx17' with parent '/' already present!
      [ 3005.274348] debugfs: Directory 'hctx18' with parent '/' already present!
      [ 3005.274351] debugfs: Directory 'hctx19' with parent '/' already present!
      [ 3005.274354] debugfs: Directory 'hctx20' with parent '/' already present!
      [ 3005.274357] debugfs: Directory 'hctx21' with parent '/' already present!
      [ 3005.274360] debugfs: Directory 'hctx22' with parent '/' already present!
      [ 3005.274362] debugfs: Directory 'hctx23' with parent '/' already present!
      [ 3005.277796] nvme nvme1: Successfully reconnected (1 attempt)
      [ 3064.893442] nvme nvme0: queue 18: timeout request 0x37 type 4
      [ 3064.893451] nvme nvme0: starting error recovery
      [ 3064.894448] nvme nvme0: Reconnecting in 10 seconds...
      [ 3075.144327] nvme nvme0: creating 24 I/O queues.
      [ 3075.402443] nvme nvme0: mapped 24/0/0 default/read/poll queues.
      [ 3075.407243] nvme nvme0: Successfully reconnected (1 attempt)
      [ 3005.277796] nvme nvme1: Successfully reconnected (1 attempt)
      [ 3064.893442] nvme nvme0: queue 18: timeout request 0x37 type 4
      [ 3064.893451] nvme nvme0: starting error recovery
      [ 3064.894448] nvme nvme0: Reconnecting in 10 seconds...
      [ 3075.144327] nvme nvme0: creating 24 I/O queues.
      [ 3075.402443] nvme nvme0: mapped 24/0/0 default/read/poll queues.
      [ 3075.407243] nvme nvme0: Successfully reconnected (1 attempt)
      ..
      [16392.473564] nvme nvme0: starting error recovery
      [16392.474660] nvme nvme0: Reconnecting in 10 seconds...
      [16402.722441] nvme nvme0: creating 24 I/O queues.
      [16402.987520] nvme nvme0: mapped 24/0/0 default/read/poll queues.
      [16402.993473] nvme nvme0: Successfully reconnected (1 attempt)
      [17737.466878] nvme nvme1: queue 23: timeout request 0x8 type 4
      [17737.466886] nvme nvme1: starting error recovery
      [17737.467932] nvme nvme1: Reconnecting in 10 seconds...
      [17747.714022] nvme nvme1: creating 24 I/O queues.
      [17747.981289] nvme nvme1: mapped 24/0/0 default/read/poll queues.
      ..
      [44207.566992] debugfs: Directory 'hctx22' with parent '/' already present!
      [44207.566995] debugfs: Directory 'hctx23' with parent '/' already present!
      [44207.571479] nvme nvme1: Successfully reconnected (1 attempt)
      [44910.247638] nvme nvme0: queue 13: timeout request 0x5b type 4
      [44910.247646] nvme nvme0: starting error recovery
      [44910.248687] nvme nvme0: Reconnecting in 10 seconds...
      [44920.496424] nvme nvme0: creating 24 I/O queues.
      [44920.754444] nvme nvme0: mapped 24/0/0 default/read/poll queues.
      [44920.760197] nvme nvme0: Successfully reconnected (1 attempt)
      [49347.144379] nvme nvme1: queue 15: timeout request 0x52 type 4
      [49347.144386] nvme nvme1: starting error recovery
      [49347.145416] nvme nvme1: Reconnecting in 10 seconds...
      [49357.402221] nvme nvme1: creating 24 I/O queues.
      [49357.671423] nvme nvme1: mapped 24/0/0 default/read/poll queues.
      [49357.671527] debugfs: Directory 'hctx0' with parent '/' already present!
      ~~~
      This patch set looks relevant as the customer confirmed it's under high IO load

      https://lore.kernel.org/all/20220519062617.39715-1-hare@suse.de/
      https://lore.kernel.org/all/20220519062617.39715-2-hare@suse.de/
      https://lore.kernel.org/all/20220519062617.39715-3-hare@suse.de/
      https://lore.kernel.org/all/20220519062617.39715-4-hare@suse.de/

            jmeneghi@redhat.com John Meneghini
            rhn-support-rmadhuso Ranjith ML
            John Meneghini John Meneghini
            storage-qe storage-qe
            Votes:
            0 Vote for this issue
            Watchers:
            8 Start watching this issue

              Created:
              Updated:
              Resolved: