Loading...

XML

Word

Printable

Type: Bug
Resolution: Done
Priority: Major
Fix Version/s: rhos-18.0.14 FR 4
Affects Version/s: 18.0.z
Component/s: infra-operator
Labels:
None

Story Points:
3
Blocked:
False
Blocked Reason:

Hide

None

Show
None
Ready:
False
Docs Approval:
?
Fixed in Build:
infra-operator-container-1.0.16
AssignedTeam:
rhos-ops-platform-services-pidone
Regression:
None
Intelligence Requested:
Market:
PX Impact Range:
PX Impact Score:
PX Priority Data:
PX Review Complete:
PX Technical Impact:

Sprint:
Sprint 6, Sprint 7
sprint_count:
2
Severity:
Critical

SFDC Cases Counter:
SFDC Cases Open:
SFDC Cases Links:

To Reproduce Steps to reproduce the behavior:

Deploy RHOSO 18.0

RabbitMQ uses TLS 1.3, not TLS 1.2

# oc get RabbitMQCluster rabbitmq-cell1  -o yaml |grep tlsv1.3
        {versions, ['tlsv1.2','tlsv1.3']}
        {versions, ['tlsv1.2','tlsv1.3']}
      {versions, ['tlsv1.2','tlsv1.3']}

Wait for a while

RabbitMQ network partition issue occurs after TLS key update events

[error] <x.xxxxx.x>     event: TLS sender received unexpected event
[error] <x.xxxxx.x>     reason: [{type,internal},{message,{key_update,{<x.xxxxx.x>,undefined}}}]
  :
[error] <x.xxxxx.x> Partial partition detected:
[error] <x.xxxxx.x>  * We saw DOWN from rabbit@rabbitmq-cell1-server-0.rabbitmq-cell1-nodes.openstack
[error] <x.xxxxx.x>  * We can still see rabbit@rabbitmq-cell1-server-2.rabbitmq-cell1-nodes.openstack which can see rabbit@rabbitmq-cell1-server-0.rabbitmq-cell1-nodes.openstack

Expected behavior

RabbitMQ is stable
RabbitMQ uses TLS 1.2, not TLS 1.3

Bug impact

RabbitMQ cluster experiences network partition randomly and it makes RHOSO control plane unusable.
The RabbitMQ is not recovered automatically until we delete/recreate RabbitMQ pods manually.
Even if we recover it, the issue recurs a few days later.

Known workaround

delete/recreate RabbitMQ pods manually after the issue occurrence

Additional context

In RHOSP 17.1, we tracked the issue in the following tickets. As the issue doesn't occur in TLS 1.2, we made some changes in TripleO to use TLS 1.2 instead TLS 1.3:
However, this change is not implemented in RHOSO operators, RabbitMQ uses TLS 1.3, and the issue occurs.
I found that RabbitMQCluster resources and some ConfigMap resources have setting of the TLS version, but I'm not sure if we can modify it manually, because I think they're managed by Operators.
We'd like to have a workaround to avoid the issue as soon as possible