Loading...

XML

Word

Printable

Type: Bug
Resolution: Unresolved
Priority: Undefined
Fix Version/s: None
Affects Version/s: rhos-18.0.16
Component/s: rabbitmq-server
Labels:
None

Story Points:
0
Epic Link:
[BugEpic]: Cinder scheduler fanout queues are not 100% resilent with RabbitMQ quorum queues
Blocked:
False
Blocked Reason:

Hide

None

Show
None
Ready:
False
Docs Approval:
?
AssignedTeam:
rhos-ops-platform-services-pidone
Regression:
None
Intelligence Requested:
Market:
PX Impact Score:

Severity:
Important

SFDC Cases Counter:
SFDC Cases Open:
SFDC Cases Links:

To Reproduce Steps to reproduce the behavior:

Customer provided quite detailed report and reproducer for this problem. I have translated it below.

Environment:

6 OpenShift nodes (3 masters, 3 workers).
RHOSO 18.0.16 control plane is deployed only on workers.

Problem:

After deploying RHOSO with quorum queues, various failure scenarios were tried. After rolling reboot of OpenShift worker nodes RPC communicatin problems between Cinder Scheduler and Cinder Volume services was identified:

Errors related to "cinder-scheduler_fanout" are logged by Cinder and rabbitmq:
Cinder Volume reports an error posting messages to it.
RabbitMQ reports cryptic errors around it.
Cinder Scheduler doesn't report any errors directly, but complains about problems with messages from Cinder Volume. For each attempt of Cinder Volume to connect and send a message, there's a corresponding log in Cinder Scheduler about receiving a message from Cinder Volume.

During all the tests we did, the issue was always related to those specific queues. We tried investigating the queues themselves - "rabbitmq-queues check_if_node_is_quorum_critical" or "rabbitmq-queues quorum_status <queue>" did not report any issues.

Customer was able to reliably reproduce the issue:

rolling restart of OpenShift workers nodes - issue reproduced
manually killing cinder-scheduler and rabbitmq pods (kill pods on current worker, wait for pods restart, move to pods on next worker nodes) - issue reproduced
downgrade cinder-scheduler to 18.0.14 (to match cinder-volume), manually killing cinder-scheduler and rabbitmq pods - issue reproduced

It looks like the issue is triggered in specific case when cinder-scheduler and rabbitmq pods are simultaneously restarted.

Usually, the issue triggers when pods are restarted on the last node. Actions against pods on first two worker nodes seem to not trigger the issue.

Expected behavior
Cluster is fully functional after recovery.

Bug impact
It could create tricky problems in customer's environment and have significant impact for support operations.

Known workaround (both work separately):

restart Cinder Scheduler pods one by one
set [oslo_messaging_rabbit]/rabbit_transient_quorum_queue = false

Additional context
Logs and must-gather attached to support case.

This situation may have been addressed in upstream via https://bugs.launchpad.net/oslo.messaging/+bug/2028384 and https://bugs.launchpad.net/oslo.messaging/+bug/2031497. But I am not 100% sure that there is a direct connection, only mentioned workarounds were tried.

This situation seem to be introduced by one of fixes for https://issues.redhat.com/browse/OSPRH-19160

Assignee:: Unassigned

Reporter:: Alex Stupnikov

Team:: rhos-dfg-pidone

Votes:: 0 Vote for this issue

Watchers:: 4 Start watching this issue

Created:: 2026/02/26 3:11 PM

Updated:: 2026/03/03 4:53 PM

Details

Description

Attachments

Easy Agile Planning Poker

Activity

People

Dates

PagerDuty