-
Bug
-
Resolution: Unresolved
-
Undefined
-
18.0.z
-
None
-
3
-
False
-
-
False
-
?
-
None
-
-
-
-
-
-
-
-
Sprint 6, Sprint 7
-
2
-
Critical
To Reproduce Steps to reproduce the behavior:
- Deploy RHOSO 18.0
- RabbitMQ uses TLS 1.3, not TLS 1.2
# oc get RabbitMQCluster rabbitmq-cell1 -o yaml |grep tlsv1.3 {versions, ['tlsv1.2','tlsv1.3']} {versions, ['tlsv1.2','tlsv1.3']} {versions, ['tlsv1.2','tlsv1.3']}
- Wait for a while
- RabbitMQ network partition issue occurs after TLS key update events
[error] <x.xxxxx.x> event: TLS sender received unexpected event [error] <x.xxxxx.x> reason: [{type,internal},{message,{key_update,{<x.xxxxx.x>,undefined}}}] : [error] <x.xxxxx.x> Partial partition detected: [error] <x.xxxxx.x> * We saw DOWN from rabbit@rabbitmq-cell1-server-0.rabbitmq-cell1-nodes.openstack [error] <x.xxxxx.x> * We can still see rabbit@rabbitmq-cell1-server-2.rabbitmq-cell1-nodes.openstack which can see rabbit@rabbitmq-cell1-server-0.rabbitmq-cell1-nodes.openstack
Expected behavior
- RabbitMQ is stable
- RabbitMQ uses TLS 1.2, not TLS 1.3
Bug impact
- RabbitMQ cluster experiences network partition randomly and it makes RHOSO control plane unusable.
- The RabbitMQ is not recovered automatically until we delete/recreate RabbitMQ pods manually.
Even if we recover it, the issue recurs a few days later.
Known workaround
- delete/recreate RabbitMQ pods manually after the issue occurrence
Additional context
- In RHOSP 17.1, we tracked the issue in the following tickets. As the issue doesn't occur in TLS 1.2, we made some changes in TripleO to use TLS 1.2 instead TLS 1.3:
- However, this change is not implemented in RHOSO operators, RabbitMQ uses TLS 1.3, and the issue occurs.
- I found that RabbitMQCluster resources and some ConfigMap resources have setting of the TLS version, but I'm not sure if we can modify it manually, because I think they're managed by Operators.
- We'd like to have a workaround to avoid the issue as soon as possible