-
Bug
-
Resolution: Unresolved
-
Undefined
-
None
-
rhos-18.0.16
-
None
-
0
-
False
-
-
False
-
?
-
rhos-ops-platform-services-pidone
-
None
-
-
-
-
Important
To Reproduce Steps to reproduce the behavior:
Customer provided quite detailed report and reproducer for this problem. I have translated it below.
Environment:
- 6 OpenShift nodes (3 masters, 3 workers).
- RHOSO 18.0.16 control plane is deployed only on workers.
Problem:
After deploying RHOSO with quorum queues, various failure scenarios were tried. After rolling reboot of OpenShift worker nodes RPC communicatin problems between Cinder Scheduler and Cinder Volume services was identified:
- Errors related to "cinder-scheduler_fanout" are logged by Cinder and rabbitmq:
- Cinder Volume reports an error posting messages to it.
- RabbitMQ reports cryptic errors around it.
- Cinder Scheduler doesn't report any errors directly, but complains about problems with messages from Cinder Volume. For each attempt of Cinder Volume to connect and send a message, there's a corresponding log in Cinder Scheduler about receiving a message from Cinder Volume.
During all the tests we did, the issue was always related to those specific queues. We tried investigating the queues themselves - "rabbitmq-queues check_if_node_is_quorum_critical" or "rabbitmq-queues quorum_status <queue>" did not report any issues.
Customer was able to reliably reproduce the issue:
- rolling restart of OpenShift workers nodes - issue reproduced
- manually killing cinder-scheduler and rabbitmq pods (kill pods on current worker, wait for pods restart, move to pods on next worker nodes) - issue reproduced
- downgrade cinder-scheduler to 18.0.14 (to match cinder-volume), manually killing cinder-scheduler and rabbitmq pods - issue reproduced
It looks like the issue is triggered in specific case when cinder-scheduler and rabbitmq pods are simultaneously restarted.
Usually, the issue triggers when pods are restarted on the last node. Actions against pods on first two worker nodes seem to not trigger the issue.
Expected behavior
Cluster is fully functional after recovery.
Bug impact
It could create tricky problems in customer's environment and have significant impact for support operations.
Known workaround (both work separately):
- restart Cinder Scheduler pods one by one
- set [oslo_messaging_rabbit]/rabbit_transient_quorum_queue = false
Additional context
Logs and must-gather attached to support case.
This situation may have been addressed in upstream via https://bugs.launchpad.net/oslo.messaging/+bug/2028384 and https://bugs.launchpad.net/oslo.messaging/+bug/2031497. But I am not 100% sure that there is a direct connection, only mentioned workarounds were tried.
This situation seem to be introduced by one of fixes for https://issues.redhat.com/browse/OSPRH-19160