-
Bug
-
Resolution: Unresolved
-
Major
-
None
-
None
-
False
-
-
False
-
?
-
rhos-ops-day1day2-upgrades
-
None
-
-
-
-
Important
During step `overcloud upgrade ...` of the FFU process, the upgrade is interrupted when step `Create containers managed by Podman` fails:
2025-05-26 11:21:53.440264 | 566fdabb-0028-5c7e-4ada-000000065620 | FATAL | Create containers managed by Podman for /var/lib/tripleo-config/container-startup-config/step_3 | overcloud-prod-ipz002-controller -0 | error={"changed": false, "msg": "Failed containers: neutron_db_sync"}
There were two containers that failed, `neutron_db_sync` with this error:
[root@overcloud-prod-ipz002-controller-0 ~]# podman logs neutron_db_sync 2>&1 | tail -10 File "/usr/lib/python3.9/site-packages/pymysql/connections.py", line 646, in _read_packet packet_header = self._read_bytes(4) File "/usr/lib/python3.9/site-packages/pymysql/connections.py", line 698, in _read_bytes raise err.OperationalError( oslo_db.exception.DBConnectionError: (pymysql.err.OperationalError) (2013, 'Lost connection to MySQL server during query') [SQL: SELECT quotas.project_id AS quotas_project_id, quotas.resource AS quotas_resource FROM quotas GROUP BY quotas.project_id, quotas.resource HAVING count(*) > %(count_1)s] [parameters: {'count_1': 1}] (Background on this error at: http://sqlalche.me/e/13/e3q8)
and `placement_api_db_sync` with this error:
[root@overcloud-prod-ipz002-controller-0 ~]# tail -5 /var/log/containers/stdouts/placement_api_db_sync.log.1 2025-05-26T11:19:05.384464530+02:00 stderr F + echo 'Running command: '\''/usr/bin/bootstrap_host_exec placement su placement -s /bin/bash -c '\''/usr/bin/placement-manage db sync'\'''\''' 2025-05-26T11:19:05.384474524+02:00 stdout F Running command: '/usr/bin/bootstrap_host_exec placement su placement -s /bin/bash -c '/usr/bin/placement-manage db sync'' 2025-05-26T11:19:05.384482210+02:00 stderr F + umask 0022 2025-05-26T11:19:05.384513068+02:00 stderr F + exec /usr/bin/bootstrap_host_exec placement su placement -s /bin/bash -c ''\''/usr/bin/placement-manage' db 'sync'\''' 2025-05-26T11:19:06.369097438+02:00 stderr F SQL connection failed. 10 attempts left.
Bug impact
- While the upgrade of this cluster is now completed, two remaining (bigger and more relevant) production clusters remain to be upgraded, and we have no idea if any of them is going to hit the same issue.
Known workaround
- No workaround, issue is not yet identified.
- So far only possible to rerun the overcloud upgrade job
Additional context
- Customer is running a hotfix for THT due to a different issue they ran into earlier, related to restart of the galera-bundle (rpm openstack-tripleo-heat-templates-14.3.1-17.1.20250424161055.e7c7ce3.el9osttrunk is in use)
- This issue seems to be addressed correctly during the upgrade
- At the time of the incident, galera seems to be partitioned:
2025-05-26 11:19:03 0 [Note] WSREP: declaring 6c4a23fa-8e87 at tcp://10.140.126.89:4567 stable 2025-05-26 11:19:03 0 [Note] WSREP: forgetting 63f3d387-8cab (tcp://10.140.126.47:4567) 2025-05-26 11:19:03 0 [Note] WSREP: Node 5fcaa8c5-922c state prim 2025-05-26 11:19:03 0 [Note] WSREP: view(view_id(PRIM,5fcaa8c5-922c,4) memb { 5fcaa8c5-922c,0 6c4a23fa-8e87,0 } joined { } left { } partitioned { 63f3d387-8cab,0 }) 2025-05-26 11:19:03 0 [Note] WSREP: save pc into disk
- The database upgrade and container restarts had happened earlier in the process
2025-05-26 09:46:20,893 p=960437 u=stack n=ansible | 2025-05-26 09:46:20.892860 | 566fdabb-0028-f5d2-b867-000000007a62 | OK | debug | overcloud-prod-ipz002-controller-0 | result={ "changed": false, "msg": "MYSQL check - VS 10.3.32-MariaDB - Upgrade needed: True" } 2025-05-26 09:46:20,933 p=960437 u=stack n=ansible | 2025-05-26 09:46:20.932793 | 566fdabb-0028-f5d2-b867-000000007a63 | TASK | Disable the galera cluster resource before container upgrade ... 2025-05-26 09:47:52,771 p=960437 u=stack n=ansible | 2025-05-26 09:47:52.771092 | 566fdabb-0028-f5d2-b867-000000007a6d | CHANGED | Enable the galera cluster resource | overcloud-prod-ipz002-controller-0