-
Bug
-
Resolution: Unresolved
-
Major
-
None
-
None
-
False
-
-
False
-
?
-
None
-
-
-
-
Important
During cluster failover in DCN environment, Cinder gets permission denied when trying to connect to the primary cluster (expected since cluster is down). However, this causes replicated volume to get stuck in error state permanently, even after successful failback.
Steps to Reproduce:
1. Create replicated volume - works fine, replicates to both clusters
2. Perform failover: cinder failover-host cinder-088b2-volume-az0-0@ceph
3. Cinder tries to demote volume on primary cluster but gets permission denied (cluster is down)
4. Volume goes to error state
5. Perform failback: cinder failover-host cinder-088b2-volume-az0-0@ceph --backend_id default
6. Primary cluster is accessible again, new volumes work fine
7. Original volume remains in error state
Expected: Original volume should be recoverable after failback
Actual: Volume permanently stuck in error state
Impact: volumes become unrecoverable even though the data exists on both clusters
—
CLI Output and Logs:
- Volume created successfully and replicated
sh-5.1$ openstack volume create replicated-vol --type replication --size 1 - Status: available, replicated to both clusters
sh-5.1$ openstack volume list
-----------------------------------------------------------------------------
ID | Name | Status | Size | Attached to |
-----------------------------------------------------------------------------
c3303940-d34c-4d18-8d39-2fa24239c0c5 | replicated-vol | available | 1 |
-----------------------------------------------------------------------------
Primary cluster:
============
[ceph: root@compute-sk7uefn8-0 /]# rbd -p volumes ls | grep -i c3303940-d34c-4d18-8d39-2fa24239c0c5
volume-c3303940-d34c-4d18-8d39-2fa24239c0c5
Secondary cluster:
==============
[ceph: root@dcn1-compute-az1-sk7uefn8-0 /]# rbd -p volumes ls | grep -i c3303940-d34c-4d18-8d39-2fa24239c0c5
volume-c3303940-d34c-4d18-8d39-2fa24239c0c5
- Failover triggers permission denied error
sh-5.1$ cinder failover-host cinder-088b2-volume-az0-0@ceph
- Error logs show permission denied when connecting to cluster
[zuul@controller-0 ~]$ oc -n openstack logs cinder-088b2-volume-az0-0
...
r connecting to ceph cluster.: rados.PermissionDeniedError: [errno 13] RADOS permission denied (error connecting to the cluster)
2025-09-18 08:47:50.156 2249656 ERROR cinder.volume.drivers.rbd Traceback (most recent call last):
2025-09-18 08:47:50.156 2249656 ERROR cinder.volume.drivers.rbd File "/usr/lib/python3.9/site-packages/cinder/volume/drivers/rbd.py", line 622, in _do_conn
2025-09-18 08:47:50.156 2249656 ERROR cinder.volume.drivers.rbd client.connect()
2025-09-18 08:47:50.156 2249656 ERROR cinder.volume.drivers.rbd File "/usr/lib/python3.9/site-packages/eventlet/tpool.py", line 193, in doit
2025-09-18 08:47:50.156 2249656 ERROR cinder.volume.drivers.rbd result = proxy_call(self._autowrap, f, *args, **kwargs)
2025-09-18 08:47:50.156 2249656 ERROR cinder.volume.drivers.rbd File "/usr/lib/python3.9/site-packages/eventlet/tpool.py", line 151, in proxy_call
2025-09-18 08:47:50.156 2249656 ERROR cinder.volume.drivers.rbd rv = execute(f, *args, **kwargs)
2025-09-18 08:47:50.156 2249656 ERROR cinder.volume.drivers.rbd File "/usr/lib/python3.9/site-packages/eventlet/tpool.py", line 132, in execute
2025-09-18 08:47:50.156 2249656 ERROR cinder.volume.drivers.rbd six.reraise(c, e, tb)
2025-09-18 08:47:50.156 2249656 ERROR cinder.volume.drivers.rbd File "/usr/lib/python3.9/site-packages/six.py", line 709, in reraise
2025-09-18 08:47:50.156 2249656 ERROR cinder.volume.drivers.rbd raise value
2025-09-18 08:47:50.156 2249656 ERROR cinder.volume.drivers.rbd File "/usr/lib/python3.9/site-packages/eventlet/tpool.py", line 86, in tworker
2025-09-18 08:47:50.156 2249656 ERROR cinder.volume.drivers.rbd rv = meth(*args, **kwargs)
2025-09-18 08:47:50.156 2249656 ERROR cinder.volume.drivers.rbd File "rados.pyx", line 690, in rados.Rados.connect
2025-09-18 08:47:50.156 2249656 ERROR cinder.volume.drivers.rbd rados.PermissionDeniedError: [errno 13] RADOS permission denied (error connecting to the cluster)
- Volume now in error state
sh-5.1$ openstack volume list
----------------------------------------------------------------------------ID Name Status Size Attached to -------------------------------------
---------------------------------------c3303940-d34c-4d18-8d39-2fa24239c0c5 replicated-vol error 1 -------------------------------------
---------------------------------------
- After failback - new volumes work, old volume still in error
sh-5.1$ cinder failover-host cinder-088b2-volume-az0-0@ceph --backend_id default
- Cinder successfully established a connection to the primary cluster
[zuul@controller-0 ~]$ oc -n openstack logs cinder-088b2-volume-az0-0
...
'cinder.volume.drivers.rbd.RBDDriver._connect_to_rados.<locals>._do_conn' after 15.106(s), this was the 3rd time calling it. log_it /usr/lib/python3.9/site-packages/tenacity/after.py:30
2025-09-18 08:49:12.420 2249656 DEBUG cinder.volume.drivers.rbd [None req-6d292cb3-12fd-4cdc-b8ca-90b514aef511 afc1d43a372b4fce8a0e8395316c352f 45d90fbdd05f4b36bc899185ffd950d2 - - - -] Failed to demote volume-c3303940-d34c-4d18-8d39-2fa24239c0c5 with error: Bad or unexpected response from the storage volume backend API: Error connecting to ceph cluster.. _demote_volumes /usr/lib/python3.9/site-packages/cinder/volume/drivers/rbd.py:1759
2025-09-18 08:49:12.421 2249656 DEBUG cinder.volume.drivers.rbd [None req-6d292cb3-12fd-4cdc-b8ca-90b514aef511 afc1d43a372b4fce8a0e8395316c352f 45d90fbdd05f4b36bc899185ffd950d2 - - - -] connecting to openstack@az0 (conf=/etc/ceph/az0.conf, timeout=5). _do_conn /usr/lib/python3.9/site-packages/cinder/volume/drivers/rbd.py:605
2025-09-18 08:49:12.563 2249656 INFO cinder.volume.drivers.rbd [None req-6d292cb3-12fd-4cdc-b8ca-90b514aef511 afc1d43a372b4fce8a0e8395316c352f 45d90fbdd05f4b36bc899185ffd950d2 - - - -] RBD driver failover completed.
2025-09-18 08:49:12.564 2249656 INFO cinder.volume.drivers.rbd [None req-6d292cb3-12fd-4cdc-b8ca-90b514aef511 afc1d43a372b4fce8a0e8395316c352f 45d90fbdd05f4b36bc899185ffd950d2 - - - -] RBD driver failover completion started.
2025-09-18 08:49:12.564 2249656 INFO cinder.volume.drivers.rbd [None req-6d292cb3-12fd-4cdc-b8ca-90b514aef511 afc1d43a372b4fce8a0e8395316c352f 45d90fbdd05f4b36bc899185ffd950d2 - - - -] RBD driver failover completion completed.
2025-09-18 08:49:12.602 2249656 INFO cinder.volume.manager [None req-6d292cb3-12fd-4cdc-b8ca-90b514aef511 afc1d43a372b4fce8a0e8395316c352f 45d90fbdd05f4b36bc899185ffd950d2 - - - -] Failed over to replication target successfully.
sh-5.1$ openstack volume create replated-vol2 --type replication --size 1 # Works fine
sh-5.1$ openstack volume list
-----------------------------------------------------------------------------
ID | Name | Status | Size | Attached to |
-----------------------------------------------------------------------------
43942bb1-8e17-4897-813d-8efeadd25e4d | replated-vol2 | available | 1 | |
c3303940-d34c-4d18-8d39-2fa24239c0c5 | replated-vol | error | 1 | <- STUCK |
-----------------------------------------------------------------------------