Loading...

Type: Bug
Resolution: Unresolved
Priority: Major
Fix Version/s: rhos-17.1.z
Affects Version/s: None
Component/s: tripleo-ansible
Labels:
None

Blocked:
False
Blocked Reason:

Hide

None

Show
None
Ready:
False
Docs Approval:
?
AssignedTeam:
rhos-ops-day1day2-upgrades
Regression:
None
Intelligence Requested:
Market:
PX Impact Score:

Severity:
Important

SFDC Cases Counter:
SFDC Cases Open:
SFDC Cases Links:

During step `overcloud upgrade ...` of the FFU process, the upgrade is interrupted when step `Create containers managed by Podman` fails:

2025-05-26 11:21:53.440264 | 566fdabb-0028-5c7e-4ada-000000065620 |      FATAL | Create containers managed by Podman for /var/lib/tripleo-config/container-startup-config/step_3 | overcloud-prod-ipz002-controller
-0 | error={"changed": false, "msg": "Failed containers: neutron_db_sync"}

There were two containers that failed, `neutron_db_sync` with this error:

[root@overcloud-prod-ipz002-controller-0 ~]# podman logs neutron_db_sync  2>&1 | tail -10
  File "/usr/lib/python3.9/site-packages/pymysql/connections.py", line 646, in _read_packet
    packet_header = self._read_bytes(4)
  File "/usr/lib/python3.9/site-packages/pymysql/connections.py", line 698, in _read_bytes
    raise err.OperationalError(
oslo_db.exception.DBConnectionError: (pymysql.err.OperationalError) (2013, 'Lost connection to MySQL server during query')
[SQL: SELECT quotas.project_id AS quotas_project_id, quotas.resource AS quotas_resource
FROM quotas GROUP BY quotas.project_id, quotas.resource
HAVING count(*) > %(count_1)s]
[parameters: {'count_1': 1}]
(Background on this error at: http://sqlalche.me/e/13/e3q8)

and `placement_api_db_sync` with this error:

[root@overcloud-prod-ipz002-controller-0 ~]# tail -5 /var/log/containers/stdouts/placement_api_db_sync.log.1
2025-05-26T11:19:05.384464530+02:00 stderr F + echo 'Running command: '\''/usr/bin/bootstrap_host_exec placement su placement -s /bin/bash -c '\''/usr/bin/placement-manage db sync'\'''\'''
2025-05-26T11:19:05.384474524+02:00 stdout F Running command: '/usr/bin/bootstrap_host_exec placement su placement -s /bin/bash -c '/usr/bin/placement-manage db sync''
2025-05-26T11:19:05.384482210+02:00 stderr F + umask 0022
2025-05-26T11:19:05.384513068+02:00 stderr F + exec /usr/bin/bootstrap_host_exec placement su placement -s /bin/bash -c ''\''/usr/bin/placement-manage' db 'sync'\'''
2025-05-26T11:19:06.369097438+02:00 stderr F SQL connection failed. 10 attempts left.

Bug impact

While the upgrade of this cluster is now completed, two remaining (bigger and more relevant) production clusters remain to be upgraded, and we have no idea if any of them is going to hit the same issue.

Known workaround

No workaround, issue is not yet identified.
So far only possible to rerun the overcloud upgrade job

Additional context

Customer is running a hotfix for THT due to a different issue they ran into earlier, related to restart of the galera-bundle (rpm openstack-tripleo-heat-templates-14.3.1-17.1.20250424161055.e7c7ce3.el9osttrunk is in use)
- This issue seems to be addressed correctly during the upgrade
At the time of the incident, galera seems to be partitioned:

2025-05-26 11:19:03 0 [Note] WSREP: declaring 6c4a23fa-8e87 at tcp://10.140.126.89:4567 stable
2025-05-26 11:19:03 0 [Note] WSREP: forgetting 63f3d387-8cab (tcp://10.140.126.47:4567)
2025-05-26 11:19:03 0 [Note] WSREP: Node 5fcaa8c5-922c state prim
2025-05-26 11:19:03 0 [Note] WSREP: view(view_id(PRIM,5fcaa8c5-922c,4) memb {
        5fcaa8c5-922c,0
        6c4a23fa-8e87,0
} joined {
} left {
} partitioned {
        63f3d387-8cab,0
})
2025-05-26 11:19:03 0 [Note] WSREP: save pc into disk

The database upgrade and container restarts had happened earlier in the process

   2025-05-26 09:46:20,893 p=960437 u=stack n=ansible | 2025-05-26 09:46:20.892860 | 566fdabb-0028-f5d2-b867-000000007a62 |         OK | debug | overcloud-prod-ipz002-controller-0 | result={
      "changed": false,
      "msg": "MYSQL check -  VS 10.3.32-MariaDB - Upgrade needed: True"
   }
   2025-05-26 09:46:20,933 p=960437 u=stack n=ansible | 2025-05-26 09:46:20.932793 | 566fdabb-0028-f5d2-b867-000000007a63 |       TASK | Disable the galera cluster resource before container upgrade

...

   2025-05-26 09:47:52,771 p=960437 u=stack n=ansible | 2025-05-26 09:47:52.771092 | 566fdabb-0028-f5d2-b867-000000007a6d |    CHANGED | Enable the galera cluster resource | overcloud-prod-ipz002-controller-0

Details

Description

Attachments

Easy Agile Planning Poker

Activity

People

Dates

PagerDuty