Uploaded image for project: 'Red Hat OpenStack Services on OpenShift'
  1. Red Hat OpenStack Services on OpenShift
  2. OSPRH-14266

OSP minor update to 17.1.4 can fail on controller for neutron_db_sync and glance_api_db_sync containers

XMLWordPrintable

    • 2
    • False
    • Hide

      None

      Show
      None
    • False
    • ?
    • openstack-tripleo-heat-templates-14.3.1-17.1.20250403151008.e7c7ce3.el9ost
    • None
    • Rejected
    • RHOS Upgrades 2025 Sprint 3
    • 1
    • Important

      To Reproduce Steps to reproduce the behavior:
      During a minor update, while updating the controller nodes, the containers neutron_db_sync and glance_api_db_sync can fail and exit with status code 1, leading to failure of the update.

      What is happening is during the run we block 3306 traffic to controller 0 for exactly 20 minutes:
      2025-02-02 16:11:41,839 p=18683 u=stack n=ansible | 2025-02-02 16:11:41.839730 | 52540078-3890-d251-7060-0000000000f4 | TASK | Block local INPUT SYN packets
      2025-02-02 16:11:42,067 p=18683 u=stack n=ansible | 2025-02-02 16:11:42.066659 | 52540078-3890-d251-7060-0000000000f4 | CHANGED | Block local INPUT SYN packets | overcloud-controller-0

      8 minutes later it tries to paunch the containers:
      2025-02-02 16:19:51,005 p=18683 u=stack n=ansible | 2025-02-02 16:19:51.005146 | 52540078-3890-d251-7060-000000003708 | TASK | Create containers managed by Podman for /var/lib/tripleo-config/container-startup-config/step_3

      But because the nft rule is still in effect, it can't reach mysql.
      Leading 10 minutes later to kill the container and give:
      2025-02-02 16:31:57,105 p=18683 u=stack n=ansible | 2025-02-02 16:31:57.104811 | | WARNING | ERROR: Container glance_api_db_sync exited with code 1 when runed
      2025-02-02 16:31:57,105 p=18683 u=stack n=ansible | 2025-02-02 16:31:57.105143 | | WARNING | ERROR: Container neutron_db_sync exited with code 1 when runed
      pymysql.err.OperationalError: (2013, 'Lost connection to MySQL server during query')

      In my lab I didn't reproduce the issue because my lab is slow and it paunched the containers 20 minutes after the rule was set so it was ok to wait, retry and finally do its job.

      We discovered that haproxy is not switching to the other two nodes.
      It keeps trying to only connect to controller-0.

      We need to understand what is going on here and what can be done to resolve this whole section.

      Expected behavior

      • neutron_db_sync and glance_api_db_sync should run without delay and finish successfully.

      Bug impact

      • minor update fails.

      Known workaround

      • killing galera container on controller-0 during that step will allow haproxy to connect to other nodes and the update will continue successfully.

              rhn-engineering-lbezdick Lukas Bezdicka
              rhn-support-ggrimaux Gregoire Grimaux
              rhos-dfg-upgrades
              Votes:
              0 Vote for this issue
              Watchers:
              9 Start watching this issue

                Created:
                Updated:
                Resolved: