Loading...

XML

Word

Printable

Type: Bug
Resolution: Unresolved
Priority: Normal
Fix Version/s: None
Affects Version/s: None
Component/s: openstack-cinder
Labels:
- Triaged

Blocked:
False
Blocked Reason:

Hide

None

Show
None
Ready:
False
Bugzilla Bug:
RHBZ: 2274196
Regression:
None
Intelligence Requested:
Market:

SFDC Cases Counter:
SFDC Cases Open:
SFDC Cases Links:

Description of problem:
Seemingly randomly heat stack creation is failing on server creation. Specifically volume creation succeeds but nova fails while waiting for a reply from cinder for volume attachment.

We have tried several different combinations of nova/cinder/httpd/mysql timeouts and retries which has resulted in slight variations of the issue, but it seems to boil down to a dbdeadlock timeout from cinder each time.

Most recently it looks to be hitting the 300s innodb_lock_wait_timeout that we set in galera.cnf:

2024-04-08 00:03:14.558 40 INFO cinder.api.openstack.wsgi [req-42b35494-5611-4ae6-8cfb-3f0990e8dd2b ca0aa87bb5d247ae8a122230c4883414 2c4a0d38096d4aa7a1de13bd0faa319f - default default] POST http://<ip>:8776/v3/2c4a0d38096d4aa7a1de13bd0faa319f/attachments

2024-04-08 00:08:15.615 40 ERROR cinder.api.v3.attachments [req-42b35494-5611-4ae6-8cfb-3f0990e8dd2b ca0aa87bb5d247ae8a122230c4883414 2c4a0d38096d4aa7a1de13bd0faa319f - default default] Unable to create attachment for volume.: oslo_db.exception.DBDeadlock: (pymysql.err.OperationalError) (1205, 'Lock wait timeout exceeded; try restarting transaction')
[SQL: INSERT INTO volume_attachment (created_at, updated_at, deleted_at, deleted, id, volume_id, instance_uuid, attached_host, mountpoint, attach_time, detach_time, attach_status, attach_mode, connection_info, connector) VALUES (%(created_at)s, %(updated_at)s, %(deleted_at)s, %(deleted)s, %(id)s, %(volume_id)s, %(instance_uuid)s, %(attached_host)s, %(mountpoint)s, %(attach_time)s, %(detach_time)s, %(attach_status)s, %(attach_mode)s, %(connection_info)s, %(connector)s)]
[parameters:

{'created_at': datetime.datetime(2024, 4, 8, 0, 3, 14, 598156), 'updated_at': None, 'deleted_at': None, 'deleted': 0, 'id': 'adb19bec-f6e7-4d45-8c1a-c0804887eac7', 'volume_id': '0ddeb9f5-f1e9-4ee8-afe8-ba5ffabf28f3', 'instance_uuid': '100c8675-12c9-4d40-b3d5-29f73d191bf9', 'attached_host': None, 'mountpoint': None, 'attach_time': None, 'detach_time': None, 'attach_status': 'reserved', 'attach_mode': None, 'connection_info': None, 'connector': None}

]

This is returning a 504 to nova:

2024-04-08 00:08:14.655 20 ERROR nova.volume.cinder [req-c1ec1a1e-be35-4c42-a489-99f72bae785c ca0aa87bb5d247ae8a122230c4883414 2c4a0d38096d4aa7a1de13bd0faa319f - default default] [instance: 100c8675-12c9-4d40-b3d5-29f73d191bf9] Create attachment failed for volume 0ddeb9f5-f1e9-4ee8-afe8-ba5ffabf28f3. Error: Gateway Timeout (HTTP 504) Code: 504: cinderclient.exceptions.ClientException: Gateway Timeout (HTTP 504)
2024-04-08 00:08:14.656 20 INFO nova.compute.api [req-c1ec1a1e-be35-4c42-a489-99f72bae785c ca0aa87bb5d247ae8a122230c4883414 2c4a0d38096d4aa7a1de13bd0faa319f - default default] Failed validating volume 0ddeb9f5-f1e9-4ee8-afe8-ba5ffabf28f3. Error: Gateway Timeout (HTTP 504)
2024-04-08 00:08:14.656 20 INFO nova.api.openstack.wsgi [req-c1ec1a1e-be35-4c42-a489-99f72bae785c ca0aa87bb5d247ae8a122230c4883414 2c4a0d38096d4aa7a1de13bd0faa319f - default default] HTTP exception thrown: Block Device Mapping is Invalid: failed to get volume 0ddeb9f5-f1e9-4ee8-afe8-ba5ffabf28f3.

Version-Release number of selected component (if applicable):

How reproducible:
randomly on weekdays, but seems to be most frequent during a 1 hour window between 00:00 and 01:00 UTC when test stacks are created.

Steps to Reproduce:
1. create heat stacks with boot from volume instances
2.
3.

Actual results:
Most stacks are created successful but some will fail, all using the same template.

Expected results:
All stacks are created successful.

Additional info:
Previously we were seeing a dbdeadlock timeout during volume creation but that does not seem to be the case this time. I'm not sure if this is due to the modified timeout values or if the deadlock issue can impact multiple different operations randomly.

2024-03-19 00:34:36.120 76 ERROR oslo_messaging.rpc.server oslo_db.exception.DBDeadlock: (pymysql.err.OperationalError) (1205, 'Lock wait timeout exceeded; try restarting transaction')
2024-03-19 00:34:36.120 76 ERROR oslo_messaging.rpc.server [SQL: INSERT INTO volume_glance_metadata (created_at, updated_at, deleted_at, deleted, volume_id, snapshot_id, `key`, value) VALUES (%(created_at)s, %(updated_at)s, %(deleted_at)s, %(deleted)s, %(volume_id)s, %(snapshot_id)s, %(key)s, %(value)s)]
2024-03-19 00:34:36.120 76 ERROR oslo_messaging.rpc.server [parameters:

{'created_at': datetime.datetime(2024, 3, 19, 0, 33, 42, 644791), 'updated_at': None, 'deleted_at': None, 'deleted': 0, 'volume_id': '0c91f47f-7958-4f70-98cc-20f7d70dda16', 'snapshot_id': None, 'key': 'hw_pointer_model', 'value': 'usbtablet'}

]

is related to

OSPRH-14000 RBD connections can hang cinder-volume

In Progress

external trackers

Red Hat Customer Portal 03734504

Assignee:: Eric Harney

Reporter:: RH Bugzilla Integration

QA Contact:: Yosi Ben Shimon

Team:: rhos-dfg-storage-squad-cinder

Votes:: 0 Vote for this issue

Watchers:: 2 Start watching this issue

Created:: 2024/04/09 4:53 PM

Updated:: 2025/02/12 7:49 PM

Details

Description

Attachments

Issue Links

Easy Agile Planning Poker

Activity

People

Dates

PagerDuty