Uploaded image for project: 'Red Hat OpenStack Services on OpenShift'
  1. Red Hat OpenStack Services on OpenShift
  2. OSPRH-13275

BZ#2274196 Cinder volume attachment failing randomly with dbdeadlock error

XMLWordPrintable

    • Icon: Bug Bug
    • Resolution: Unresolved
    • Icon: Normal Normal
    • None
    • None
    • openstack-cinder
    • False
    • Hide

      None

      Show
      None
    • False
    • None

      Description of problem:
      Seemingly randomly heat stack creation is failing on server creation. Specifically volume creation succeeds but nova fails while waiting for a reply from cinder for volume attachment.

      We have tried several different combinations of nova/cinder/httpd/mysql timeouts and retries which has resulted in slight variations of the issue, but it seems to boil down to a dbdeadlock timeout from cinder each time.

      Most recently it looks to be hitting the 300s innodb_lock_wait_timeout that we set in galera.cnf:

      2024-04-08 00:03:14.558 40 INFO cinder.api.openstack.wsgi [req-42b35494-5611-4ae6-8cfb-3f0990e8dd2b ca0aa87bb5d247ae8a122230c4883414 2c4a0d38096d4aa7a1de13bd0faa319f - default default] POST http://<ip>:8776/v3/2c4a0d38096d4aa7a1de13bd0faa319f/attachments

      2024-04-08 00:08:15.615 40 ERROR cinder.api.v3.attachments [req-42b35494-5611-4ae6-8cfb-3f0990e8dd2b ca0aa87bb5d247ae8a122230c4883414 2c4a0d38096d4aa7a1de13bd0faa319f - default default] Unable to create attachment for volume.: oslo_db.exception.DBDeadlock: (pymysql.err.OperationalError) (1205, 'Lock wait timeout exceeded; try restarting transaction')
      [SQL: INSERT INTO volume_attachment (created_at, updated_at, deleted_at, deleted, id, volume_id, instance_uuid, attached_host, mountpoint, attach_time, detach_time, attach_status, attach_mode, connection_info, connector) VALUES (%(created_at)s, %(updated_at)s, %(deleted_at)s, %(deleted)s, %(id)s, %(volume_id)s, %(instance_uuid)s, %(attached_host)s, %(mountpoint)s, %(attach_time)s, %(detach_time)s, %(attach_status)s, %(attach_mode)s, %(connection_info)s, %(connector)s)]
      [parameters:

      {'created_at': datetime.datetime(2024, 4, 8, 0, 3, 14, 598156), 'updated_at': None, 'deleted_at': None, 'deleted': 0, 'id': 'adb19bec-f6e7-4d45-8c1a-c0804887eac7', 'volume_id': '0ddeb9f5-f1e9-4ee8-afe8-ba5ffabf28f3', 'instance_uuid': '100c8675-12c9-4d40-b3d5-29f73d191bf9', 'attached_host': None, 'mountpoint': None, 'attach_time': None, 'detach_time': None, 'attach_status': 'reserved', 'attach_mode': None, 'connection_info': None, 'connector': None}

      ]

      This is returning a 504 to nova:

      2024-04-08 00:08:14.655 20 ERROR nova.volume.cinder [req-c1ec1a1e-be35-4c42-a489-99f72bae785c ca0aa87bb5d247ae8a122230c4883414 2c4a0d38096d4aa7a1de13bd0faa319f - default default] [instance: 100c8675-12c9-4d40-b3d5-29f73d191bf9] Create attachment failed for volume 0ddeb9f5-f1e9-4ee8-afe8-ba5ffabf28f3. Error: Gateway Timeout (HTTP 504) Code: 504: cinderclient.exceptions.ClientException: Gateway Timeout (HTTP 504)
      2024-04-08 00:08:14.656 20 INFO nova.compute.api [req-c1ec1a1e-be35-4c42-a489-99f72bae785c ca0aa87bb5d247ae8a122230c4883414 2c4a0d38096d4aa7a1de13bd0faa319f - default default] Failed validating volume 0ddeb9f5-f1e9-4ee8-afe8-ba5ffabf28f3. Error: Gateway Timeout (HTTP 504)
      2024-04-08 00:08:14.656 20 INFO nova.api.openstack.wsgi [req-c1ec1a1e-be35-4c42-a489-99f72bae785c ca0aa87bb5d247ae8a122230c4883414 2c4a0d38096d4aa7a1de13bd0faa319f - default default] HTTP exception thrown: Block Device Mapping is Invalid: failed to get volume 0ddeb9f5-f1e9-4ee8-afe8-ba5ffabf28f3.

      Version-Release number of selected component (if applicable):

      How reproducible:
      randomly on weekdays, but seems to be most frequent during a 1 hour window between 00:00 and 01:00 UTC when test stacks are created.

      Steps to Reproduce:
      1. create heat stacks with boot from volume instances
      2.
      3.

      Actual results:
      Most stacks are created successful but some will fail, all using the same template.

      Expected results:
      All stacks are created successful.

      Additional info:
      Previously we were seeing a dbdeadlock timeout during volume creation but that does not seem to be the case this time. I'm not sure if this is due to the modified timeout values or if the deadlock issue can impact multiple different operations randomly.

      2024-03-19 00:34:36.120 76 ERROR oslo_messaging.rpc.server oslo_db.exception.DBDeadlock: (pymysql.err.OperationalError) (1205, 'Lock wait timeout exceeded; try restarting transaction')
      2024-03-19 00:34:36.120 76 ERROR oslo_messaging.rpc.server [SQL: INSERT INTO volume_glance_metadata (created_at, updated_at, deleted_at, deleted, volume_id, snapshot_id, `key`, value) VALUES (%(created_at)s, %(updated_at)s, %(deleted_at)s, %(deleted)s, %(volume_id)s, %(snapshot_id)s, %(key)s, %(value)s)]
      2024-03-19 00:34:36.120 76 ERROR oslo_messaging.rpc.server [parameters:

      {'created_at': datetime.datetime(2024, 3, 19, 0, 33, 42, 644791), 'updated_at': None, 'deleted_at': None, 'deleted': 0, 'volume_id': '0c91f47f-7958-4f70-98cc-20f7d70dda16', 'snapshot_id': None, 'key': 'hw_pointer_model', 'value': 'usbtablet'}

      ]

              eharney@redhat.com Eric Harney
              jira-bugzilla-migration RH Bugzilla Integration
              Yosi Ben Shimon Yosi Ben Shimon
              rhos-dfg-storage-squad-cinder
              Votes:
              0 Vote for this issue
              Watchers:
              2 Start watching this issue

                Created:
                Updated: