Uploaded image for project: 'Red Hat OpenStack Services on OpenShift'
  1. Red Hat OpenStack Services on OpenShift
  2. OSPRH-20070

Cinder replicated volume stuck in error state after failover due to permission denied in DCN environment

XMLWordPrintable

    • False
    • Hide

      None

      Show
      None
    • False
    • ?
    • None
    • Important

      During cluster failover in DCN environment, Cinder gets permission denied when trying to connect to the primary cluster (expected since cluster is down). However, this causes replicated volume to get stuck in error state permanently, even after successful failback.

      Steps to Reproduce:
      1. Create replicated volume - works fine, replicates to both clusters
      2. Perform failover: cinder failover-host cinder-088b2-volume-az0-0@ceph
      3. Cinder tries to demote volume on primary cluster but gets permission denied (cluster is down)
      4. Volume goes to error state
      5. Perform failback: cinder failover-host cinder-088b2-volume-az0-0@ceph --backend_id default
      6. Primary cluster is accessible again, new volumes work fine
      7. Original volume remains in error state

      Expected: Original volume should be recoverable after failback
      Actual: Volume permanently stuck in error state

      Impact: volumes become unrecoverable even though the data exists on both clusters


      CLI Output and Logs:

      1. Volume created successfully and replicated
        sh-5.1$ openstack volume create replicated-vol --type replication --size 1
      2. Status: available, replicated to both clusters

      sh-5.1$ openstack volume list
      -----------------------------------------------------------------------------

      ID Name Status Size Attached to

      -----------------------------------------------------------------------------

      c3303940-d34c-4d18-8d39-2fa24239c0c5 replicated-vol available 1  

      -----------------------------------------------------------------------------

      Primary cluster:
      ============
      [ceph: root@compute-sk7uefn8-0 /]# rbd -p volumes ls | grep -i c3303940-d34c-4d18-8d39-2fa24239c0c5
      volume-c3303940-d34c-4d18-8d39-2fa24239c0c5

      Secondary cluster:
      ==============
      [ceph: root@dcn1-compute-az1-sk7uefn8-0 /]# rbd -p volumes ls | grep -i c3303940-d34c-4d18-8d39-2fa24239c0c5
      volume-c3303940-d34c-4d18-8d39-2fa24239c0c5

      1. Failover triggers permission denied error
        sh-5.1$ cinder failover-host cinder-088b2-volume-az0-0@ceph
      1. Error logs show permission denied when connecting to cluster
        [zuul@controller-0 ~]$ oc -n openstack logs cinder-088b2-volume-az0-0
        ...
        r connecting to ceph cluster.: rados.PermissionDeniedError: [errno 13] RADOS permission denied (error connecting to the cluster)
        2025-09-18 08:47:50.156 2249656 ERROR cinder.volume.drivers.rbd Traceback (most recent call last):
        2025-09-18 08:47:50.156 2249656 ERROR cinder.volume.drivers.rbd File "/usr/lib/python3.9/site-packages/cinder/volume/drivers/rbd.py", line 622, in _do_conn
        2025-09-18 08:47:50.156 2249656 ERROR cinder.volume.drivers.rbd client.connect()
        2025-09-18 08:47:50.156 2249656 ERROR cinder.volume.drivers.rbd File "/usr/lib/python3.9/site-packages/eventlet/tpool.py", line 193, in doit
        2025-09-18 08:47:50.156 2249656 ERROR cinder.volume.drivers.rbd result = proxy_call(self._autowrap, f, *args, **kwargs)
        2025-09-18 08:47:50.156 2249656 ERROR cinder.volume.drivers.rbd File "/usr/lib/python3.9/site-packages/eventlet/tpool.py", line 151, in proxy_call
        2025-09-18 08:47:50.156 2249656 ERROR cinder.volume.drivers.rbd rv = execute(f, *args, **kwargs)
        2025-09-18 08:47:50.156 2249656 ERROR cinder.volume.drivers.rbd File "/usr/lib/python3.9/site-packages/eventlet/tpool.py", line 132, in execute
        2025-09-18 08:47:50.156 2249656 ERROR cinder.volume.drivers.rbd six.reraise(c, e, tb)
        2025-09-18 08:47:50.156 2249656 ERROR cinder.volume.drivers.rbd File "/usr/lib/python3.9/site-packages/six.py", line 709, in reraise
        2025-09-18 08:47:50.156 2249656 ERROR cinder.volume.drivers.rbd raise value
        2025-09-18 08:47:50.156 2249656 ERROR cinder.volume.drivers.rbd File "/usr/lib/python3.9/site-packages/eventlet/tpool.py", line 86, in tworker
        2025-09-18 08:47:50.156 2249656 ERROR cinder.volume.drivers.rbd rv = meth(*args, **kwargs)
        2025-09-18 08:47:50.156 2249656 ERROR cinder.volume.drivers.rbd File "rados.pyx", line 690, in rados.Rados.connect
        2025-09-18 08:47:50.156 2249656 ERROR cinder.volume.drivers.rbd rados.PermissionDeniedError: [errno 13] RADOS permission denied (error connecting to the cluster)
      1. Volume now in error state
        sh-5.1$ openstack volume list
        ----------------------------------------------------------------------------
        ID Name Status Size Attached to

        ----------------------------------------------------------------------------

        c3303940-d34c-4d18-8d39-2fa24239c0c5 replicated-vol error 1  

        ----------------------------------------------------------------------------

      1. After failback - new volumes work, old volume still in error
        sh-5.1$ cinder failover-host cinder-088b2-volume-az0-0@ceph --backend_id default
      1. Cinder successfully established a connection to the primary cluster
        [zuul@controller-0 ~]$ oc -n openstack logs cinder-088b2-volume-az0-0
        ...
        'cinder.volume.drivers.rbd.RBDDriver._connect_to_rados.<locals>._do_conn' after 15.106(s), this was the 3rd time calling it. log_it /usr/lib/python3.9/site-packages/tenacity/after.py:30
        2025-09-18 08:49:12.420 2249656 DEBUG cinder.volume.drivers.rbd [None req-6d292cb3-12fd-4cdc-b8ca-90b514aef511 afc1d43a372b4fce8a0e8395316c352f 45d90fbdd05f4b36bc899185ffd950d2 - - - -] Failed to demote volume-c3303940-d34c-4d18-8d39-2fa24239c0c5 with error: Bad or unexpected response from the storage volume backend API: Error connecting to ceph cluster.. _demote_volumes /usr/lib/python3.9/site-packages/cinder/volume/drivers/rbd.py:1759
        2025-09-18 08:49:12.421 2249656 DEBUG cinder.volume.drivers.rbd [None req-6d292cb3-12fd-4cdc-b8ca-90b514aef511 afc1d43a372b4fce8a0e8395316c352f 45d90fbdd05f4b36bc899185ffd950d2 - - - -] connecting to openstack@az0 (conf=/etc/ceph/az0.conf, timeout=5). _do_conn /usr/lib/python3.9/site-packages/cinder/volume/drivers/rbd.py:605
        2025-09-18 08:49:12.563 2249656 INFO cinder.volume.drivers.rbd [None req-6d292cb3-12fd-4cdc-b8ca-90b514aef511 afc1d43a372b4fce8a0e8395316c352f 45d90fbdd05f4b36bc899185ffd950d2 - - - -] RBD driver failover completed.
        2025-09-18 08:49:12.564 2249656 INFO cinder.volume.drivers.rbd [None req-6d292cb3-12fd-4cdc-b8ca-90b514aef511 afc1d43a372b4fce8a0e8395316c352f 45d90fbdd05f4b36bc899185ffd950d2 - - - -] RBD driver failover completion started.
        2025-09-18 08:49:12.564 2249656 INFO cinder.volume.drivers.rbd [None req-6d292cb3-12fd-4cdc-b8ca-90b514aef511 afc1d43a372b4fce8a0e8395316c352f 45d90fbdd05f4b36bc899185ffd950d2 - - - -] RBD driver failover completion completed.
        2025-09-18 08:49:12.602 2249656 INFO cinder.volume.manager [None req-6d292cb3-12fd-4cdc-b8ca-90b514aef511 afc1d43a372b4fce8a0e8395316c352f 45d90fbdd05f4b36bc899185ffd950d2 - - - -] Failed over to replication target successfully.
        sh-5.1$ openstack volume create replated-vol2 --type replication --size 1 # Works fine

      sh-5.1$ openstack volume list
      -----------------------------------------------------------------------------

      ID Name Status Size Attached to

      -----------------------------------------------------------------------------

      43942bb1-8e17-4897-813d-8efeadd25e4d replated-vol2 available 1  
      c3303940-d34c-4d18-8d39-2fa24239c0c5 replated-vol error 1 <- STUCK

      -----------------------------------------------------------------------------

              rdhasman@redhat.com Rajat Dhasmana
              lkuchlan liron kuchlani
              rhos-storage-cinder
              Votes:
              0 Vote for this issue
              Watchers:
              4 Start watching this issue

                Created:
                Updated: