-
Bug
-
Resolution: Done-Errata
-
Critical
-
rhos-18.0.0
-
8
-
False
-
-
False
-
?
-
python-os-brick-6.2.5-18.0.20250526195219.7ad0ed6.el9ost
-
None
-
-
Enhancement
-
Done
-
-
-
Critical
The goal is to speed up detaches as proposed in this upstream patch: FC: Avoid calling ``lsscsi`` during disconnect | https://review.opendev.org/c/openstack/os-brick/+/943123
The task is to get the patch merged and then backport it to 18 -> 17.1 -> 16.2.
Original description follows:
Issue:
-------------------------------------------
It is in Partner Fujitsu's end customer's environment.
When a customer rebooted a node, It took about 2 hours to complete (Complete means all 11 instances in this node are running fine). (*1)
Each instance has many volumes (max 137 volumes attached for one instance).
When the issue occured, total 11 instances (total approximately 500 volumes attached) in this node.
Moreover, They tried to stop/start one instance on May 22 (*2) once. It took 20 minutes to complete then.
Recently on Dec 28 (*1), They rebooted the node because of the HW issue, and It took about 2 hours to complete (11 instances are running).
$ nova instance-action-list YGC6WINTRA12X [2024-12-29 12:54:13.801] +---------------+------------------------------------------+---------+----------------------------+----------------------------+ [2024-12-29 12:54:13.801] | Action | Request_ID | Message | Start_Time | Updated_At | [2024-12-29 12:54:13.801] +---------------+------------------------------------------+---------+----------------------------+----------------------------+ [2024-12-29 12:54:13.801] | create | req-26af493c-bb3b-4472-9bde-aecb52b9885b | - | 2024-02-22T04:13:14.000000 | 2024-02-22T04:13:53.000000 | [2024-12-29 12:54:13.801] | attach_volume | req-d57a65ba-aa6e-4a52-993d-f2c2f685f95b | - | 2024-02-22T08:08:58.000000 | 2024-02-22T08:09:02.000000 | [2024-12-29 12:54:13.801] | attach_volume | req-8a07fb56-7cb5-4c73-9896-c7fa4b56bf39 | - | 2024-02-22T08:09:01.000000 | 2024-02-22T08:09:05.000000 | ... there are about 130 volumes attached. [2024-12-29 12:54:14.182] | attach_volume | req-2430dec1-7ac0-42e0-b5d2-0d847bcddacc | - | 2024-03-13T08:42:22.000000 | 2024-03-13T08:42:27.000000 | [2024-12-29 12:54:14.203] | stop | req-4528b893-3578-418a-819e-268c98c804d9 | - | 2024-03-15T05:00:31.000000 | 2024-03-15T05:00:32.000000 | [2024-12-29 12:54:14.203] | start | req-574cb633-7c3a-4f6b-9aab-cd2a0beedbed | - | 2024-03-15T05:07:24.000000 | 2024-03-15T05:21:40.000000 | [2024-12-29 12:54:14.203] | stop | req-1286a969-e005-469e-8415-9007598d0a1d | - | 2024-05-22T07:07:06.000000 | 2024-05-22T07:07:07.000000 | [2024-12-29 12:54:14.203] | start | req-7f85510a-31f6-44ec-827d-e020e5d52bed | - | 2024-05-22T07:10:23.000000 | 2024-05-22T07:32:30.000000 | [2024-12-29 12:54:14.203] | reboot | req-256c4ea4-55ae-4b02-bb6e-2674dfc3b59d | - | 2024-12-28T02:37:12.000000 | 2024-12-28T03:49:10.000000 | (*2) [2024-12-29 12:54:14.203] | reboot | req-29065ae4-fc80-432a-a29d-fbe931e07466 | - | 2024-12-28T13:38:43.000000 | 2024-12-28T15:26:57.000000 | (*1) [2024-12-29 12:54:14.203] +---------------+------------------------------------------+---------+----------------------------+----------------------------+ [2024-12-29 12:54:14.203] (PRJ002-WDC-PR1-KyouyouIntra) [fjse@WDCB120T PRJ002-WDC-PR1-KyouyouIntra]$
Upon Fujitsu's reviewing the nova-compute.log, it appears that the disconnection and reconnection processes of each volume were serialized by locks, which extended the overall reboot time.
$ less nova-compute.log ... 2024-12-29 00:00:05.439 7 INFO nova.compute.manager [req-54654cd0-c87a-40ec-8bcd-8ef39baa47f0 - - - - -] Running instance usage audit for host wdcb120e.ygcloud-area6.osp from 2024-12-28 14:00:00 to 2024-12-28 15:00:00. 14 instances. 2024-12-29 00:00:09.514 7 INFO os_brick.initiator.linuxscsi [req-29065ae4-fc80-432a-a29d-fbe931e07466 22836b00260444f6b8f9bc3689853007 dbff8a709f8b43ffb769046e967f07a8 - 628295187cdf47f188de3ca859626380 628295187cdf47f188de3ca859626380] Find Multipath device file for volume WWN 3600000e00d32000000324e2800420000 2024-12-29 00:00:17.359 7 INFO os_brick.initiator.linuxscsi [req-301afaed-3305-41a3-a78d-739eb611e54f 22836b00260444f6b8f9bc3689853007 6475c86cb73f411fa8b52b68488f1ad6 - 628295187cdf47f188de3ca859626380 628295187cdf47f188de3ca859626380] Find Multipath device file for volume WWN 3600000e00d3000000030091007a70000 2024-12-29 00:00:18.590 7 INFO os_brick.initiator.linuxscsi [req-b039d858-1e3f-4b4f-bf3c-4285b0ef245a 22836b00260444f6b8f9bc3689853007 0e938a1fe2cb46298aff036c28fc15c4 - 628295187cdf47f188de3ca859626380 628295187cdf47f188de3ca859626380] Find Multipath device file for volume WWN 3600000e00d3000000030091005910000 2024-12-29 00:00:26.685 7 INFO os_brick.initiator.linuxscsi [req-29065ae4-fc80-432a-a29d-fbe931e07466 22836b00260444f6b8f9bc3689853007 dbff8a709f8b43ffb769046e967f07a8 - 628295187cdf47f188de3ca859626380 628295187cdf47f188de3ca859626380] Find Multipath device file for volume WWN 3600000e00d32000000324e2800430000
The same WWN was logged twice in one reboot for disconnection and reconnection.
According to them Each attach/detach process took about 8 secs, so the entire reboot process took more than 1 hour. (500[connections] x 8[secs] x 2[attach/detach]
Regarding disconnection and reconnection, it seems that the following process acquires a lock.
/usr/lib/python3.6/site-packages/os_brick/initiator/connectors/fibre_channel.py
-------------------- 187 @utils.trace 188 @synchronized('connect_volume', external=True) 189 def connect_volume(self, connection_properties): ... 214 # The /dev/disk/by-path/... node is not always present immediately 215 # We only need to find the first device. Once we see the first device 216 # multipath will have any others. 217 def _wait_for_device_discovery(host_devices): ... 242 timer = loopingcall.FixedIntervalLoopingCall( 243 _wait_for_device_discovery, host_devices) 244 timer.start(interval=2).wait() 245 246 LOG.debug("Found Fibre Channel volume %(name)s " 247 "(after %(tries)s rescans.)", 248 {'name': self.device_name, 'tries': self.tries}) ... --------------------
-------------------- 312 @utils.trace 313 @synchronized('connect_volume', external=True) 314 def disconnect_volume(self, connection_properties, device_info, 315 force=False, ignore_errors=False): ... 356 LOG.debug("devices to remove = %s", devices) 357 self._remove_devices(connection_properties, devices, device_info, 358 force, exc) ... --------------------
It seems that serialization due to lock processing is the cause.
According to Fujitsu , Taking a lot of time to reboot is an important impact for them.
So, they request us to parallelize the attach and detach procedure to speed up the entire reboot time.
If it is hard to fix, They are requesting a workaround instead of this.
Version-Release number of selected component:
Red Hat OpenStack Platform Version Number: 17
Release Number: 17.1.3
Related Package Version: none
Related Middleware / Application: none
Underlying RHEL Release Number: 9.4
Underlying RHEL Architecture: x86_64
Drivers or hardware or architecture dependency:
- Volume backends with Fibre-channel
- 500 volume connections on a compute node
How reproducible:
Always.
This problem is more pronounced under condition when multiple instances were hard rebooted simultaneously.
Step to Reproduce:
Use the following command to reboot multiple instances on the same Compute node:
$ openstack server reboot --hard <instance id>
Actual Results:
Instance boot time exceeded one hour.
Expected Results:
All instances are rebooted in a few minutes.
---------------------------------------
Business Impact:
The customer is a flagship company in Japanese transport service industry.
The system will be the platform, and the system provides the services for 1.3 million companies in Japan.
The customer's workload is blocked during reboots. Reboot time directly impacts downtime; longer reboots mean longer down
- links to
-
RHBA-2025:152056 Release of components for RHOSO 18.0
- mentioned on