-
Feature
-
Resolution: Done
-
Critical
-
None
-
Critical
-
Not Selected
-
False
-
False
-
-
-
0
-
0
-
0% To Do, 0% In Progress, 100% Done
-
Enhancement
-
-
Done
-
- Proposed title of this feature request
- Implement RabbitMQ quorum queues
- What is the nature and description of the request?
- Implement quorum queues within the RabbitMQ deployment used for RHOSO. Mirrored HA queues are deprecated in RabbitMQ and have not been implemented or tested with RHOSO.
- Requires vhost support to be implemented in RabbitMQ cluster topology operator.
- Why does the customer need this? (List the business requirements here)
- The lack of an HA solution for RabbitMQ is a regression from previous OSP releases. Without an HA solution, there is no full recovery in all corner cases, like network partitions, slow processes and unexpected terminations of peers.
- Without HA on the message bus, there can be situations where it's not possible to stand up virtual machines due to inability for messages to be delivered.
- List any affected packages or components.
- When the message bus becomes unavailable (in situations such as pods being moved between OpenShift worker instances) Nova can become disconnected from the message bus, resulting in errors and an inability to deploy virtual machines.-
*Feature Request Overview *
What user goal or problem do you need to solve?
This feature request is to give greenfields RHOSO deployments the ability to use quorum queues to provide a more resilient and fault tolerant messaging backend to openstack services.
Historical context
In OSP17 and older versions we set a runtime policy via pacemaker resource agents to force all queues to be mirrored:
Resource: rabbitmq (class=ocf provider=heartbeat type=rabbitmq-cluster) Attributes: rabbitmq-instance_attributes set_policy='ha-all ^(?!amq\.).*
'
OpenStack services that consume the rabbitmq cluster have their queues and exchanges mirrored without explicit configurations required in the respective configuration files.
RHOSO before FR2
In RHOSO18 we decided not to set this policy anymore (true up to fr2, see below) to let each service decide the most appropriate queue configuration.
Most of the services seem to have lifted and shifted their oslo configuration from OSP17 so they are still operating under the assumption that there is a global rabbit policy in place, or that rabbitmq is providing a similar transparent configuration policy.
RHOSO FR2+
Because of customers pressure and disruptiveness of a minor update on API availability we had to find a solution to the lack of explicit queue configurations for each service that could:
Be applied to existing environments
Did not require downtime
Would provide resiliency to updates and rabbitmq pods disruptions
The only workaround that would satisfy all the three requirements was to apply the same policy as we have in OSP17.
Business justification
How would this feature benefit the customer?
Quorum queues offer a more resilient messaging backend. They improve how the cluster behaves under network partitions and failure scenarios, providing increased availability to OpenStack APIs and the entire RHOSO control plane as a whole.
Functional requirements
What do you want the result of this feature to be? Add as many requirements as needed.
- Implement quorum queues within the RabbitMQ deployment used for RHOSO. Mirrored HA queues are deprecated in RabbitMQ. They have been temporarily reintroduced in FR2 as a workaround until a better solution was implemented.
- Code changes in infra-operator to enable quorum queues globally, and code change in each service operator to allow the configuration of quorum queues for each service.
- Enable config for quorum queues in all services/operators as default in FR4
- Reviews on code changes in operators from all teams.
- Automated test cases to test scenarios
- Perf and scale to test in HA scenarios
- Doc change
Describe the customer impact
The customer mentioned encountering multiple RabbitMQ issues with the default set up.
Given it's a greenfield cluster deployment of RHOSO (connected then 4 air gapped), the customer wants to leverage Quorum queues in RHOSO from the beginning to avoid later migration, and any associated consequences.
The customer mentioned some rabbitMQ official references highlighting the benefits of having Quorum queues compared to Mirrored Queues [1] [2]
[1]: https://www.rabbitmq.com/docs/3.13/migrate-mcq-to-qq
[2]: https://www.rabbitmq.com/blog/2023/03/02/quorum-queues-migration
Feature Overview (mandatory - Complete while in New status)
An elevator pitch (value statement) that describes the Feature in a clear, concise way. ie: Executive Summary of the user goal or problem that is being solved, why does this matter to the user? The “What & Why”...
<your text here>
Goals (mandatory - Complete while in New status)
Provide high-level goal statement, providing user context and expected user outcome(s) for this Feature
- Who benefits from this Feature, and how?
- What is the difference between today’s current state and a world with this Feature?
<your text here>
Requirements (mandatory -_ Complete while in Refinement status):
A list of specific needs, capabilities, or objectives that a Feature must deliver to satisfy the Feature. Some requirements will be flagged as MVP. If an MVP gets shifted, the Feature shifts. If a non MVP requirement slips, it does not shift the feature.
| Requirement | Notes | isMVP? |
|---|---|---|
Done - Acceptance Criteria (mandatory - Complete while in Refinement status):
Acceptance Criteria articulates and defines the value proposition - what is required to meet the goal and intent of this Feature. The Acceptance Criteria provides a detailed definition of scope and the expected outcomes - from a users point of view
…
<your text here>
Use Cases - i.e. User Experience & Workflow: (Initial completion while in Refinement status):
Include use case diagrams, main success scenarios, alternative flow scenarios.
<your text here>
Out of Scope _ _(Initial completion while in Refinement status):
High-level list of items or persona’s that are out of scope.
<your text here>
Documentation Considerations _ _(Initial completion while in Refinement status):
Provide information that needs to be considered and planned so that documentation will meet customer needs. If the feature extends existing functionality, provide a link to its current documentation..
<your text here>
Questions to Answer _ _(Initial completion while in Refinement status):
Include a list of refinement / architectural questions that may need to be answered before coding can begin.
<your text here>
Background and Strategic Fit (Initial completion while in Refinement status):
Provide any additional context is needed to frame the feature.
<your text here>
Customer Considerations _ _(Initial completion while in Refinement status):
Provide any additional customer-specific considerations that must be made when designing and delivering the Feature.
<your text here>
Team Sign Off (Completion while in Planning status)
- All required Epics (known at the time) are linked to the this Feature
- All required Stories, Tasks (known at the time) for the most immediate Epics have been created and estimated
- Add - Reviewers name, Team Name
- Acceptance == Feature as “Ready” - well understood and scope is clear - Acceptance Criteria (scope) is elaborated, well defined, and understood
- Note: Only set FixVersion/s: on a Feature if the delivery team agrees they have the capacity and have committed that capability for that milestone
Reviewed By Team Name Accepted Notes