Loading...

XML

Word

Printable

Type: Bug
Resolution: Unresolved
Priority: Blocker
Fix Version/s: rhos-18.0.16
Affects Version/s: rhos-18.0.15
Component/s: nova-operator
Labels:
None

Story Points:
0
Epic Link:
[BugEpic]: Reconcilliation issue when deleting one rabbitmq-notifications-server pod
Blocked:
False
Blocked Reason:

Hide

None

Show
None
Ready:
False
Docs Approval:
?
AssignedTeam:
rhos-workloads-compute
Regression:
None
Release Note Text:

Hide
`notificationsBusInstance` configuration for RabbitMqCluster CR causes service downtime::
Due to a bug in nova-operator, if `notificationsBusInstance` is configured in the `nova` section of the `OpenStackControlPlane` custom resource (CR) to point to a RabbitMqCluster CR, and a pod in that cluster is restarted, then nova-operator reconfigures all Compute services twice. This results in unnecessary service downtime.
+
*Temporary workaround:*
1. Remove the `notificationsBusInstance` configuration from the `OpenStackControlPlane` CR. Do not remove the notification RabbitMqCluster definition from the `OpenStackControlPlane` CR.
2. Gather the `transport_url` from the RabbitMqCluster:
+
----
$ oc get secret <name-rabbitmq-notifications>-default-user -o json | jq '.data | map_values(@base64d) | "rabbit://\(.username):\(.password)@\(.host):\(.port)/?ssl=1"'
----
+
* Replace `<name-rabbitmq-notifications>` with the name of the RabbitMqCluster for notifications.

3. For each `nova` service in the `OpenStackControlPlane` CR, add the following to the `customerServiceConfig` field:
+
----
[oslo_messaging_notifications]
transport_url = <transport_url value from the previous step>
driver = messagingv2
----
+
----
[notifications]
notify_on_state_change = vm_and_task_state
notification_format=both
----
4. For each `nova` `OpenStackDataPlaneService` CR, add the above config snippet to the related nova extra config map and then create an `OpenStackDataPlaneDeployment` CR to apply the config changes on the data plane nodes.
+
[Note]
This makes the notification message bus configuration in nova static. If the RabbitMqCluster is changed in a way that affects the effective `transport_url` of the cluster, then you must perform the above nova configuration procedure again.
+
[Warning]
The `customServiceConfig` stores the configuration in plain text, and the `transport_url` contains the user and password of the RabbitMqCluster. Applying this workaround decreases the security of the notification rabbitmq cluster.

Show
`notificationsBusInstance` configuration for RabbitMqCluster CR causes service downtime:: Due to a bug in nova-operator, if `notificationsBusInstance` is configured in the `nova` section of the `OpenStackControlPlane` custom resource (CR) to point to a RabbitMqCluster CR, and a pod in that cluster is restarted, then nova-operator reconfigures all Compute services twice. This results in unnecessary service downtime. + *Temporary workaround:* 1. Remove the `notificationsBusInstance` configuration from the `OpenStackControlPlane` CR. Do not remove the notification RabbitMqCluster definition from the `OpenStackControlPlane` CR. 2. Gather the `transport_url` from the RabbitMqCluster: + ---- $ oc get secret <name-rabbitmq-notifications>-default-user -o json | jq '.data | map_values(@base64d) | "rabbit://\(.username):\(.password)@\(.host):\(.port)/?ssl=1"' ---- + * Replace `<name-rabbitmq-notifications>` with the name of the RabbitMqCluster for notifications. 3. For each `nova` service in the `OpenStackControlPlane` CR, add the following to the `customerServiceConfig` field: + ---- [oslo_messaging_notifications] transport_url = <transport_url value from the previous step> driver = messagingv2 ---- + ---- [notifications] notify_on_state_change = vm_and_task_state notification_format=both ---- 4. For each `nova` `OpenStackDataPlaneService` CR, add the above config snippet to the related nova extra config map and then create an `OpenStackDataPlaneDeployment` CR to apply the config changes on the data plane nodes. + [Note] This makes the notification message bus configuration in nova static. If the RabbitMqCluster is changed in a way that affects the effective `transport_url` of the cluster, then you must perform the above nova configuration procedure again. + [Warning] The `customServiceConfig` stores the configuration in plain text, and the `transport_url` contains the user and password of the RabbitMqCluster. Applying this workaround decreases the security of the notification rabbitmq cluster.
Release Note Type:
Known Issue
Release Note Status:
Done
Intelligence Requested:
Market:
PX Impact Score:

Severity:
Important

SFDC Cases Counter:
SFDC Cases Open:
SFDC Cases Links:

We are experiencing reconciliation issues with all Nova API, scheduler, and conductor pods whenever any pod in the RabbitMQ instance used for the Nova notification server fails or restarts. This behavior causes the Nova services to become unstable. This is a critical issue because the failure or restart of a single rabbitmq_notification_server pod directly impacts Nova service.

Customer is able to reproduce the issue on both of their clusters.

To reproduce they just delete the rabbit_notification_server pod

"""
We have 2 environments with notificationBusInstance configured in nova service. In both, when any of rabbitmq pods from notification cluster is deleted, it cause redeployment of Nova. It does not happen for cell/global rabbitmq cluster pods.

To answer below:

lastTransitionTime: "2026-01-26T14:49:21Z"
message: OpenStackControlPlane Nova completed
reason: Ready
status: "True"
type: OpenStackControlPlaneNovaReady

This transition happen after rabbitmq pod deletion.
"""

links to

openstack-k8s-operators/nova-operator#1082: Return early when notification TransportURL is not ready

Assignee:: Unassigned

Reporter:: Eduard Barrera Casas

Team:: rhos-workloads-compute

Votes:: 0 Vote for this issue

Watchers:: 7 Start watching this issue

Created:: 2026/02/06 11:35 AM

Updated:: 2026/03/10 3:13 PM

Details

Description

Attachments

Issue Links

Easy Agile Planning Poker

Activity

People

Dates

PagerDuty