Loading...

XML

Word

Printable

Type: Bug
Resolution: Done-Errata
Priority: Critical
Fix Version/s: rhos-17.1.6
Affects Version/s: rhos-17.1.4
Component/s: openstack-tripleo-heat-templates
Labels:
None

Story Points:
2
Blocked:
False
Blocked Reason:

Hide

None

Show
None
Ready:
False
Docs Approval:
?
Fixed in Build:
openstack-tripleo-heat-templates-14.3.1-17.1.20250403151008.e7c7ce3.el9ost
Regression:
None
Intelligence Requested:
Market:
Release Commit Exception:
Rejected
Errata Link:
https://errata.engineering.redhat.com/advisory/148328

Sprint:
RHOS Upgrades 2025 Sprint 3
sprint_count:
1
Severity:
Important

SFDC Cases Counter:
SFDC Cases Open:
SFDC Cases Links:

To Reproduce Steps to reproduce the behavior:
During a minor update, while updating the controller nodes, the containers neutron_db_sync and glance_api_db_sync can fail and exit with status code 1, leading to failure of the update.

What is happening is during the run we block 3306 traffic to controller 0 for exactly 20 minutes:
2025-02-02 16:11:41,839 p=18683 u=stack n=ansible | 2025-02-02 16:11:41.839730 | 52540078-3890-d251-7060-0000000000f4 | TASK | Block local INPUT SYN packets
2025-02-02 16:11:42,067 p=18683 u=stack n=ansible | 2025-02-02 16:11:42.066659 | 52540078-3890-d251-7060-0000000000f4 | CHANGED | Block local INPUT SYN packets | overcloud-controller-0

8 minutes later it tries to paunch the containers:
2025-02-02 16:19:51,005 p=18683 u=stack n=ansible | 2025-02-02 16:19:51.005146 | 52540078-3890-d251-7060-000000003708 | TASK | Create containers managed by Podman for /var/lib/tripleo-config/container-startup-config/step_3

But because the nft rule is still in effect, it can't reach mysql.
Leading 10 minutes later to kill the container and give:
2025-02-02 16:31:57,105 p=18683 u=stack n=ansible | 2025-02-02 16:31:57.104811 | | WARNING | ERROR: Container glance_api_db_sync exited with code 1 when runed
2025-02-02 16:31:57,105 p=18683 u=stack n=ansible | 2025-02-02 16:31:57.105143 | | WARNING | ERROR: Container neutron_db_sync exited with code 1 when runed
pymysql.err.OperationalError: (2013, 'Lost connection to MySQL server during query')

In my lab I didn't reproduce the issue because my lab is slow and it paunched the containers 20 minutes after the rule was set so it was ok to wait, retry and finally do its job.

We discovered that haproxy is not switching to the other two nodes.
It keeps trying to only connect to controller-0.

We need to understand what is going on here and what can be done to resolve this whole section.

Expected behavior

neutron_db_sync and glance_api_db_sync should run without delay and finish successfully.

Bug impact

minor update fails.

Known workaround

killing galera container on controller-0 during that step will allow haproxy to connect to other nodes and the update will continue successfully.

links to

RHBA-2025:148328 Red Hat OpenStack Platform 17.1 bug fix and enhancement advisory

mentioned on

Merge request - [update] Fix galera outage

Merge request - [update] Fixup syn blocking

Assignee:: Lukas Bezdicka

Reporter:: Gregoire Grimaux

Team:: rhos-dfg-upgrades

Votes:: 0 Vote for this issue

Watchers:: 9 Start watching this issue

Created:: 2025/02/24 4:17 PM

Updated:: 2025/09/13 11:50 PM

Resolved:: 2025/05/07 4:00 PM

Details

Description

Attachments

Issue Links

Easy Agile Planning Poker

Activity

People

Dates

PagerDuty