Loading...

XML

Word

Printable

Type: Bug
Resolution: Done-Errata
Priority: Critical
Fix Version/s: rhos-18.0.10 FR 3
Affects Version/s: rhos-18.0.0
Component/s: python-os-brick
Labels:
- OtherQA

Story Points:
8
Blocked:
False
Blocked Reason:

Hide

None

Show
None
Ready:
False
Docs Approval:
?
Fixed in Build:
python-os-brick-6.2.5-18.0.20250526195219.7ad0ed6.el9ost
Gerrit Link:
https://review.opendev.org/c/openstack/os-brick/+/943123
Regression:
None
Release Note Text:

Hide
.Improved Fibre Channel performance when detaching a volume

With this update, there is improved Fibre Channel performance when detaching a volume because there is no longer a requirement to call the `lsscsi` command.

Show
.Improved Fibre Channel performance when detaching a volume With this update, there is improved Fibre Channel performance when detaching a volume because there is no longer a requirement to call the `lsscsi` command.
Release Note Type:
Enhancement
Release Note Status:
Done
Git Pull Request:
https://gitlab.cee.redhat.com/eng/openstack/python-os-brick/-/merge_requests/8
Intelligence Requested:
Market:
Errata Link:
https://errata.engineering.redhat.com/advisory/152056
Target Version:

rhos-18.0.9

Severity:
Critical

SFDC Cases Counter:
SFDC Cases Open:
SFDC Cases Links:

The goal is to speed up detaches as proposed in this upstream patch: FC: Avoid calling ``lsscsi`` during disconnect | https://review.opendev.org/c/openstack/os-brick/+/943123

The task is to get the patch merged and then backport it to 18 -> 17.1 -> 16.2.

Original description follows:

Issue:
-------------------------------------------

It is in Partner Fujitsu's end customer's environment.
When a customer rebooted a node, It took about 2 hours to complete (Complete means all 11 instances in this node are running fine). (*1)

Each instance has many volumes (max 137 volumes attached for one instance).
When the issue occured, total 11 instances (total approximately 500 volumes attached) in this node.

Moreover, They tried to stop/start one instance on May 22 (*2) once. It took 20 minutes to complete then.
Recently on Dec 28 (*1), They rebooted the node because of the HW issue, and It took about 2 hours to complete (11 instances are running).

$ nova instance-action-list YGC6WINTRA12X
[2024-12-29 12:54:13.801] +---------------+------------------------------------------+---------+----------------------------+----------------------------+
[2024-12-29 12:54:13.801] | Action        | Request_ID                               | Message | Start_Time                 | Updated_At                 |
[2024-12-29 12:54:13.801] +---------------+------------------------------------------+---------+----------------------------+----------------------------+
[2024-12-29 12:54:13.801] | create        | req-26af493c-bb3b-4472-9bde-aecb52b9885b | -       | 2024-02-22T04:13:14.000000 | 2024-02-22T04:13:53.000000 |
[2024-12-29 12:54:13.801] | attach_volume | req-d57a65ba-aa6e-4a52-993d-f2c2f685f95b | -       | 2024-02-22T08:08:58.000000 | 2024-02-22T08:09:02.000000 |
[2024-12-29 12:54:13.801] | attach_volume | req-8a07fb56-7cb5-4c73-9896-c7fa4b56bf39 | -       | 2024-02-22T08:09:01.000000 | 2024-02-22T08:09:05.000000 |
... there are about 130 volumes attached.
[2024-12-29 12:54:14.182] | attach_volume | req-2430dec1-7ac0-42e0-b5d2-0d847bcddacc | -       | 2024-03-13T08:42:22.000000 | 2024-03-13T08:42:27.000000 |
[2024-12-29 12:54:14.203] | stop          | req-4528b893-3578-418a-819e-268c98c804d9 | -       | 2024-03-15T05:00:31.000000 | 2024-03-15T05:00:32.000000 |
[2024-12-29 12:54:14.203] | start         | req-574cb633-7c3a-4f6b-9aab-cd2a0beedbed | -       | 2024-03-15T05:07:24.000000 | 2024-03-15T05:21:40.000000 |
[2024-12-29 12:54:14.203] | stop          | req-1286a969-e005-469e-8415-9007598d0a1d | -       | 2024-05-22T07:07:06.000000 | 2024-05-22T07:07:07.000000 |
[2024-12-29 12:54:14.203] | start         | req-7f85510a-31f6-44ec-827d-e020e5d52bed | -       | 2024-05-22T07:10:23.000000 | 2024-05-22T07:32:30.000000 |
[2024-12-29 12:54:14.203] | reboot        | req-256c4ea4-55ae-4b02-bb6e-2674dfc3b59d | -       | 2024-12-28T02:37:12.000000 | 2024-12-28T03:49:10.000000 | (*2)
[2024-12-29 12:54:14.203] | reboot        | req-29065ae4-fc80-432a-a29d-fbe931e07466 | -       | 2024-12-28T13:38:43.000000 | 2024-12-28T15:26:57.000000 | (*1)
[2024-12-29 12:54:14.203] +---------------+------------------------------------------+---------+----------------------------+----------------------------+ 
[2024-12-29 12:54:14.203] (PRJ002-WDC-PR1-KyouyouIntra) [fjse@WDCB120T PRJ002-WDC-PR1-KyouyouIntra]$

Upon Fujitsu's reviewing the nova-compute.log, it appears that the disconnection and reconnection processes of each volume were serialized by locks, which extended the overall reboot time.

$ less nova-compute.log
...
2024-12-29 00:00:05.439 7 INFO nova.compute.manager [req-54654cd0-c87a-40ec-8bcd-8ef39baa47f0 - - - - -] Running instance usage audit for host wdcb120e.ygcloud-area6.osp from 2024-12-28 14:00:00 to 2024-12-28 15:00:00. 14 instances.
2024-12-29 00:00:09.514 7 INFO os_brick.initiator.linuxscsi [req-29065ae4-fc80-432a-a29d-fbe931e07466 22836b00260444f6b8f9bc3689853007 dbff8a709f8b43ffb769046e967f07a8 - 628295187cdf47f188de3ca859626380 628295187cdf47f188de3ca859626380] Find Multipath device file for volume WWN 3600000e00d32000000324e2800420000
2024-12-29 00:00:17.359 7 INFO os_brick.initiator.linuxscsi [req-301afaed-3305-41a3-a78d-739eb611e54f 22836b00260444f6b8f9bc3689853007 6475c86cb73f411fa8b52b68488f1ad6 - 628295187cdf47f188de3ca859626380 628295187cdf47f188de3ca859626380] Find Multipath device file for volume WWN 3600000e00d3000000030091007a70000
2024-12-29 00:00:18.590 7 INFO os_brick.initiator.linuxscsi [req-b039d858-1e3f-4b4f-bf3c-4285b0ef245a 22836b00260444f6b8f9bc3689853007 0e938a1fe2cb46298aff036c28fc15c4 - 628295187cdf47f188de3ca859626380 628295187cdf47f188de3ca859626380] Find Multipath device file for volume WWN 3600000e00d3000000030091005910000
2024-12-29 00:00:26.685 7 INFO os_brick.initiator.linuxscsi [req-29065ae4-fc80-432a-a29d-fbe931e07466 22836b00260444f6b8f9bc3689853007 dbff8a709f8b43ffb769046e967f07a8 - 628295187cdf47f188de3ca859626380 628295187cdf47f188de3ca859626380] Find Multipath device file for volume WWN 3600000e00d32000000324e2800430000

The same WWN was logged twice in one reboot for disconnection and reconnection.
According to them Each attach/detach process took about 8 secs, so the entire reboot process took more than 1 hour. (500[connections] x 8[secs] x 2[attach/detach]

Regarding disconnection and reconnection, it seems that the following process acquires a lock.
/usr/lib/python3.6/site-packages/os_brick/initiator/connectors/fibre_channel.py

     --------------------
     187     @utils.trace
     188     @synchronized('connect_volume', external=True)
     189     def connect_volume(self, connection_properties):
     ...
     214         # The /dev/disk/by-path/... node is not always present immediately
     215         # We only need to find the first device.  Once we see the first device
     216         # multipath will have any others.
     217         def _wait_for_device_discovery(host_devices):
     ...
     242         timer = loopingcall.FixedIntervalLoopingCall(
     243             _wait_for_device_discovery, host_devices)
     244         timer.start(interval=2).wait()
     245
     246         LOG.debug("Found Fibre Channel volume %(name)s "
     247                   "(after %(tries)s rescans.)",
     248                   {'name': self.device_name, 'tries': self.tries})
     ...
     --------------------

     --------------------
     312     @utils.trace
     313     @synchronized('connect_volume', external=True)
     314     def disconnect_volume(self, connection_properties, device_info,
     315                           force=False, ignore_errors=False):
     ...
     356         LOG.debug("devices to remove = %s", devices)
     357         self._remove_devices(connection_properties, devices, device_info,
     358                              force, exc)
     ...
     --------------------

It seems that serialization due to lock processing is the cause.

According to Fujitsu , Taking a lot of time to reboot is an important impact for them.
So, they request us to parallelize the attach and detach procedure to speed up the entire reboot time.
If it is hard to fix, They are requesting a workaround instead of this.

Version-Release number of selected component:

Red Hat OpenStack Platform Version Number: 17
Release Number: 17.1.3
Related Package Version: none
Related Middleware / Application: none
Underlying RHEL Release Number: 9.4
Underlying RHEL Architecture: x86_64

Drivers or hardware or architecture dependency:

Volume backends with Fibre-channel
500 volume connections on a compute node

How reproducible:
Always.
This problem is more pronounced under condition when multiple instances were hard rebooted simultaneously.

Step to Reproduce:
Use the following command to reboot multiple instances on the same Compute node:
$ openstack server reboot --hard <instance id>

Actual Results:
Instance boot time exceeded one hour.

Expected Results:
All instances are rebooted in a few minutes.

---------------------------------------
Business Impact:
The customer is a flagship company in Japanese transport service industry.
The system will be the platform, and the system provides the services for 1.3 million companies in Japan.
The customer's workload is blocked during reboots. Reboot time directly impacts downtime; longer reboots mean longer down

links to

RHBA-2025:152056 Release of components for RHOSO 18.0

mentioned on

Merge request - FC: Avoid calling ``lsscsi`` during disconnect

Assignee:: Rajat Dhasmana

Reporter:: Takeshi Saito

Contributors:: Brian Rosmaita, Eric Harney

Team:: rhos-storage-cinder

Contributing Groups:: Fujitsu Confidential Group, Red Hat Employee

Votes:: 0 Vote for this issue

Watchers:: 6 Start watching this issue

Created:: 2025/05/26 4:45 AM

Updated:: 2025/09/13 4:10 AM

Resolved:: 2025/07/31 2:03 PM

Details

Description

Attachments

Issue Links

Easy Agile Planning Poker

Activity

People

Dates

PagerDuty