Loading...

Type: Bug
Resolution: Unresolved
Priority: Normal
Fix Version/s: None
Affects Version/s: rhel-9.2.0
Component/s: device-mapper-multipath
Labels:
None

Regression:
No
Severity:
Moderate

Pool Team:

rhel-sst-logical-storage
Sub-System Group:

ssg_filesystems_storage_and_HA

Story Points:
2
Blocked:
False
Blocked Reason:

Hide

None

Show
None
Product Documentation Required:
None
Products:

Red Hat Enterprise Linux
Sprint:
None

Preliminary Testing:
None
Test Coverage:
None

Experience:
Architecture:

x86_64

PX Impact Score:
SFDC Cases Counter:
SFDC Cases Open:
SFDC Cases Links:

Planning:
None

What were you trying to do that didn't work?

Customer environment is OpenStack 17.1.3 running on RHEL9.2 (can't be upgraded beyond that version).
Storage unit is a Dell XT 480 connected with FC.
3 servers are used for this situation each with 2 FC card with 2 ports:
Fibre Channel [0c04]: Emulex Corporation LPe31000/LPe32000 Series 16Gb/32Gb Fibre Channel Adapter [10df:e300] (rev 01)
Subsystem: Emulex Corporation LPe31002-M6-D 2-Port 16Gb Fibre Channel Adapter [10df:e322]
When one device is attached to a server we see 16 paths in multipath -ll.

The way we use it in this case is we have a service in openstack (glance) that will pull an image on the storage unit, download it locally on the server, do some operations on it then push it back to the storage unit through another service (cinder).
The setup works fine when done one at a time.

When doing multiple at a time like 5 for example, it would complete 1 maybe 2 out of 5. it would fail on openstack side because of a timeout (2 minutes). But the image we use is less than 2GB. It shouldn't take 2 minutes to make a connection to the storage unit, download, do operations then upload back.

What ultimately helped was disabling one of the two FC card on all 3 servers.
This reduced the count of path to 4.
And with that change, creating up to 10 volumes simultaneously works fine.

Problem 1:
From our analysis it seems to take a long time to create the connection from the server to the storage unit.

#Questions:
Could it be that 16 paths per device is the problem here ? Because when you create 10 volumes at a time it does generate 160 paths to handle very quickly.

Problem 2:
After rebooting a server and creating a few volumes, we quickly see left over in multipath -ll.
The output from that command would take 30s-60s before we get anything. It pokes many dead path and it waits until timeout before showing anything.

This is an example of the output after the server has been up for 22h:

80823.742317 | sdk: prio = const (setting: emergency fallback - alua failed)
80824.356142 | sde: prio = const (setting: emergency fallback - alua failed)
80825.097672 | sdf: prio = const (setting: emergency fallback - alua failed)
80825.652831 | sdj: prio = const (setting: emergency fallback - alua failed)
350060160cce0201850060160cce02018 dm-24 DGC,VRAID
size=2.0G features='0' hwhandler='0' wp=rw
`-+- policy='round-robin 0' prio=0 status=enabled
  |- 8:0:6:52  sdk  8:160  failed faulty running
  |- 8:0:5:52  sde  8:64   failed faulty running
  |- 8:0:3:52  sdf  8:80   failed faulty running
  `- 8:0:4:52  sdj  8:144  failed faulty running
360060160a5115f005508f066c5b646a8 dm-14 DGC,VRAID
size=10G features='1 queue_if_no_path' hwhandler='1 alua' wp=rw
|-+- policy='round-robin 0' prio=50 status=active
| |- 8:0:3:0   sdt  65:48  active ready running
| `- 8:0:4:0   sdx  65:112 active ready running
`-+- policy='round-robin 0' prio=10 status=enabled
  |- 8:0:6:0   sdaf 65:240 active ready running
  `- 8:0:5:0   sdab 65:176 active ready running
360060160a5115f008fbfe966a1fb3fee dm-15 DGC,VRAID
size=5.0G features='1 queue_if_no_path' hwhandler='1 alua' wp=rw
|-+- policy='round-robin 0' prio=50 status=active
| |- 8:0:6:1   sdaj 66:48  active ready running
| `- 8:0:5:1   sdad 65:208 active ready running
`-+- policy='round-robin 0' prio=10 status=enabled
  |- 8:0:4:1   sdz  65:144 active ready running
  `- 8:0:3:1   sdv  65:80  active ready running
360060160c8115f00fbc7e9662e435154 dm-16 DGC,VRAID
size=10G features='1 queue_if_no_path' hwhandler='1 alua' wp=rw
|-+- policy='round-robin 0' prio=50 status=active
| |- 8:0:1:1   sdb  8:16   active ready running
| `- 8:0:2:1   sdp  8:240  active ready running
`-+- policy='round-robin 0' prio=10 status=enabled
  |- 8:0:0:1   sdi  8:128  active ready running
  `- 8:0:7:1   sdan 66:112 active ready running

In theory it should not have any volume attached after pulling/pushing them because they are not used after by the server.

In our logs we see multipath errors like these:
2024-09-27 19:41:20.435 10 DEBUG os_brick.initiator.linuxfc [-] Could not get HBA channel and SCSI target ID, path: /sys/class/fc_transport/target11:*, reason: Unexpected error while running command.
2024-09-27 19:41:20.476 10 DEBUG os_brick.initiator.linuxfc [-] Could not get HBA channel and SCSI target ID, path: /sys/class/fc_transport/target14:*, reason: Unexpected error while running command.
2024-09-27 19:41:20.492 10 DEBUG os_brick.initiator.linuxfc [-] Could not get HBA channel and SCSI target ID, path: /sys/class/fc_transport/target14:*, reason: Unexpected error while running command.
...
2024-09-27 19:41:33.618 10 DEBUG os_brick.initiator.linuxfc [-] Could not get HBA channel and SCSI target ID, path: /sys/class/fc_transport/target13:*, reason: Unexpected error while running command.
2024-09-27 19:41:33.625 10 DEBUG os_brick.initiator.linuxfc [-] Could not get HBA channel and SCSI target ID, path: /sys/class/fc_transport/target11:*, reason: Unexpected error while running command.
...
2024-09-27 19:41:37.546 10 ERROR os_brick.initiator.connectors.fibre_channel [-] Fibre Channel volume device not found.
2024-09-27 19:41:37.546 10 ERROR oslo.service.loopingcall [-] Fixed interval looping call 'os_brick.initiator.connectors.fibre_channel.FibreChannelConnector.connect_volume.<locals>._wait_for_device_discovery' failed: os_brick.exception.NoFibreChannelVolumeDeviceFound: Unable to find a Fibre Channel volume device.
2024-09-27 19:41:37.546 10 ERROR oslo.service.loopingcall Traceback (most recent call last):
2024-09-27 19:41:37.546 10 ERROR oslo.service.loopingcall File "/usr/lib/python3.9/site-packages/oslo_service/loopingcall.py", line 150, in _run_loop
2024-09-27 19:41:37.546 10 ERROR oslo.service.loopingcall result = func(*self.args, **self.kw)
2024-09-27 19:41:37.546 10 ERROR oslo.service.loopingcall File "/usr/lib/python3.9/site-packages/os_brick/initiator/connectors/fibre_channel.py", line 233, in _wait_for_device_discovery
2024-09-27 19:41:37.546 10 ERROR oslo.service.loopingcall raise exception.NoFibreChannelVolumeDeviceFound()
2024-09-27 19:41:37.546 10 ERROR oslo.service.loopingcall os_brick.exception.NoFibreChannelVolumeDeviceFound: Unable to find a Fibre Channel volume device.
2024-09-27 19:41:37.546 10 ERROR oslo.service.loopingcall

Multipath config seems right for their storage backend.

The request here is to help us identify what could be causing this slowness.

This is an example of how much time it took to create a connection and download the image:

Here we see the image took 87 seconds to download
2024-09-25 23:42:28.547 30 DEBUG cinder.image.image_utils [req-09296d53-947c-4e92-93cd-29f3218e16b6 b70338f23547435eb290c9b0998ab0b6 76721a71bb484b2fb1753bb2a7bc290f - - -] Image fetch details: dest /var/lib/cinder/conversion/image_fetch
_f5227146-7d23-485e-8a58-104745c8557a_q17zusf_hostgroup@unity_01_ssd, size 1872.56 MB, duration 87.11 sec fetch /usr/lib/python3.9/site-packages/cinder/image/image_utils.py:566

For a speed of 21MB/sec
2024-09-25 23:42:28.548 30 INFO cinder.image.image_utils [req-09296d53-947c-4e92-93cd-29f3218e16b6 b70338f23547435eb290c9b0998ab0b6 76721a71bb484b2fb1753bb2a7bc290f - - -] Image download 1872.56 MB at 21.50 MB/s
But from what we understand it includes the connection time. The actual download of the file goes quick.

We need your help to see where the problem could be.

Please reach out to me for questions.
I tried to summarize many hours of troubleshooting in a remote session.

Thank you.

What is the impact of this issue to you?

Prevents creating multiple volumes at the same time.

Please provide the package NVR for which the bug is seen:

Multipath runs in a container on those servers.
Package version inside the container image:
device-mapper-multipath-0.8.7-20.el9.x86_64

How reproducible is this bug?:

Try to create multiple volumes at the same time.

Steps to reproduce

for i in 1..10; do openstack volume create --image $uuid --size $size name_$i &;done

Expected results

Connection to the storage unit be faster
Multipath paths get removed properly.

Actual results

Sometimes a volume won't create.
We have dead paths slowing down multipath commands.

Details

Description

What were you trying to do that didn't work?

What is the impact of this issue to you?

Please provide the package NVR for which the bug is seen:

How reproducible is this bug?:

Steps to reproduce

Expected results

Actual results

Attachments

Easy Agile Planning Poker

Activity

People

Dates