Uploaded image for project: 'Red Hat OpenStack Services on OpenShift'
  1. Red Hat OpenStack Services on OpenShift
  2. OSPRH-17131

FFU 16.2.6 to 17.1.5 fails with database connection error while running db_sync task

XMLWordPrintable

    • Icon: Bug Bug
    • Resolution: Unresolved
    • Icon: Major Major
    • rhos-17.1.z
    • None
    • tripleo-ansible
    • None
    • False
    • Hide

      None

      Show
      None
    • False
    • ?
    • rhos-ops-day1day2-upgrades
    • None
    • Important

      During step  `overcloud upgrade ...` of the FFU process, the upgrade is interrupted when step `Create containers managed by Podman` fails:

      2025-05-26 11:21:53.440264 | 566fdabb-0028-5c7e-4ada-000000065620 |      FATAL | Create containers managed by Podman for /var/lib/tripleo-config/container-startup-config/step_3 | overcloud-prod-ipz002-controller
      -0 | error={"changed": false, "msg": "Failed containers: neutron_db_sync"} 

      There were two containers that failed, `neutron_db_sync` with this error:

      [root@overcloud-prod-ipz002-controller-0 ~]# podman logs neutron_db_sync  2>&1 | tail -10
        File "/usr/lib/python3.9/site-packages/pymysql/connections.py", line 646, in _read_packet
          packet_header = self._read_bytes(4)
        File "/usr/lib/python3.9/site-packages/pymysql/connections.py", line 698, in _read_bytes
          raise err.OperationalError(
      oslo_db.exception.DBConnectionError: (pymysql.err.OperationalError) (2013, 'Lost connection to MySQL server during query')
      [SQL: SELECT quotas.project_id AS quotas_project_id, quotas.resource AS quotas_resource
      FROM quotas GROUP BY quotas.project_id, quotas.resource
      HAVING count(*) > %(count_1)s]
      [parameters: {'count_1': 1}]
      (Background on this error at: http://sqlalche.me/e/13/e3q8) 

      and `placement_api_db_sync` with this error:

      [root@overcloud-prod-ipz002-controller-0 ~]# tail -5 /var/log/containers/stdouts/placement_api_db_sync.log.1
      2025-05-26T11:19:05.384464530+02:00 stderr F + echo 'Running command: '\''/usr/bin/bootstrap_host_exec placement su placement -s /bin/bash -c '\''/usr/bin/placement-manage db sync'\'''\'''
      2025-05-26T11:19:05.384474524+02:00 stdout F Running command: '/usr/bin/bootstrap_host_exec placement su placement -s /bin/bash -c '/usr/bin/placement-manage db sync''
      2025-05-26T11:19:05.384482210+02:00 stderr F + umask 0022
      2025-05-26T11:19:05.384513068+02:00 stderr F + exec /usr/bin/bootstrap_host_exec placement su placement -s /bin/bash -c ''\''/usr/bin/placement-manage' db 'sync'\'''
      2025-05-26T11:19:06.369097438+02:00 stderr F SQL connection failed. 10 attempts left. 

       

      Bug impact

      • While the upgrade of this cluster is now completed, two remaining (bigger and more relevant) production clusters remain to be upgraded, and we have no idea if any of them is going to hit the same issue.

      Known workaround

      • No workaround, issue is not yet identified.
      • So far only possible to rerun the overcloud upgrade job

      Additional context

      • Customer is running a hotfix for THT due to a different issue they ran into earlier, related to restart of the galera-bundle (rpm openstack-tripleo-heat-templates-14.3.1-17.1.20250424161055.e7c7ce3.el9osttrunk is in use)
        • This issue seems to be addressed correctly during the upgrade
      • At the time of the incident, galera seems to be partitioned:
      2025-05-26 11:19:03 0 [Note] WSREP: declaring 6c4a23fa-8e87 at tcp://10.140.126.89:4567 stable
      2025-05-26 11:19:03 0 [Note] WSREP: forgetting 63f3d387-8cab (tcp://10.140.126.47:4567)
      2025-05-26 11:19:03 0 [Note] WSREP: Node 5fcaa8c5-922c state prim
      2025-05-26 11:19:03 0 [Note] WSREP: view(view_id(PRIM,5fcaa8c5-922c,4) memb {
              5fcaa8c5-922c,0
              6c4a23fa-8e87,0
      } joined {
      } left {
      } partitioned {
              63f3d387-8cab,0
      })
      2025-05-26 11:19:03 0 [Note] WSREP: save pc into disk 
      • The database upgrade and container restarts had happened earlier in the process
         2025-05-26 09:46:20,893 p=960437 u=stack n=ansible | 2025-05-26 09:46:20.892860 | 566fdabb-0028-f5d2-b867-000000007a62 |         OK | debug | overcloud-prod-ipz002-controller-0 | result={
            "changed": false,
            "msg": "MYSQL check -  VS 10.3.32-MariaDB - Upgrade needed: True"
         }
         2025-05-26 09:46:20,933 p=960437 u=stack n=ansible | 2025-05-26 09:46:20.932793 | 566fdabb-0028-f5d2-b867-000000007a63 |       TASK | Disable the galera cluster resource before container upgrade
      
      ...
      
         2025-05-26 09:47:52,771 p=960437 u=stack n=ansible | 2025-05-26 09:47:52.771092 | 566fdabb-0028-f5d2-b867-000000007a6d |    CHANGED | Enable the galera cluster resource | overcloud-prod-ipz002-controller-0

       

              jbadiapa@redhat.com Juan Payno
              rhn-support-enothen Eric Nothen
              rhos-dfg-upgrades
              Votes:
              0 Vote for this issue
              Watchers:
              2 Start watching this issue

                Created:
                Updated: