Uploaded image for project: 'Red Hat OpenStack Services on OpenShift'
  1. Red Hat OpenStack Services on OpenShift
  2. OSPRH-11282

ovn-metadata-agent not recovered after compute rebooted

XMLWordPrintable

    • Icon: Bug Bug
    • Resolution: Unresolved
    • Icon: Normal Normal
    • rhos-18.0.4
    • rhos-18.0 Feature Release 1 (Nov 2024)
    • openstack-neutron
    • None
    • 2
    • False
    • Hide

      Currently CI failing because of OSPRH-11316

      Show
      Currently CI failing because of OSPRH-11316
    • False
    • ?
    • ?
    • ?
    • ?
    • None
    • Important

      Description of problem:
      (Related to
      https://bugs.launchpad.net/neutron/+bug/2086740
      )

      (Pending to confirm!!) This bug could be a regression from
      https://bugzilla.redhat.com/show_bug.cgi?id=2236159
      The tobiko test test_soft_reboot_computes_recovery failed:
      http://rhos-ci-logs.lab.eng.tlv2.redhat.com/logs/rcj/DFG-network-ovn-bgp-agent-17.1_director-rhel-virthost-basic_topology-ipv4-geneve-composable/38/test_results/tobiko_faults_1/tobiko_faults_1_03_faults_faults.html?sort=time
      It failed because the ovn-metadata-agent from cmp-2-0 didn't recover after that compute was rebooted.
      The test soft reboots one compute, waits until it's up again (checks uptime), then reboots the next one, etc.
      When all computes have been rebooted, the test performs the list of tobiko health checks including checking neutron agents (sends request to the neutron API). This fails because the ovn-metadata-agent from cmp-2-0 is not "alive" (the test retries for a long time and that agent never recovers).

      I have compared ovn-metadata-agent logs from cmp-2-0 (fails) versus cmp-2-1 (recovers).

      1) cmp-2-0
      After that ChassisPrivateCreateEvent , it gets stuck in that lock forever
      http://rhos-ci-logs.lab.eng.tlv2.redhat.com/logs/rcj/DFG-network-ovn-bgp-agent-17.1_director-rhel-virthost-basic_topology-ipv4-geneve-composable/38/cmp-2-0/var/log/containers/neutron/ovn-metadata-agent.log.gz
      2024-11-01 14:21:13.663 2552 DEBUG ovsdbapp.backend.ovs_idl [-] Created schema index Datapath_Binding.tunnel_key autocreate_indices /usr/lib/python3.9/site-packages/ovsdbapp/backend/ovs_idl/_init_.py:126
      2024-11-01 14:21:13.663 2552 DEBUG ovsdbapp.backend.ovs_idl [-] Created schema index Chassis_Private.name autocreate_indices /usr/lib/python3.9/site-packages/ovsdbapp/backend/ovs_idl/_init_.py:126
      2024-11-01 14:21:13.666 2552 INFO ovsdbapp.backend.ovs_idl.vlog [-] ssl:172.20.2.163:6642: connecting...
      2024-11-01 14:21:13.674 2552 INFO ovsdbapp.backend.ovs_idl.vlog [-] ssl:172.20.2.163:6642: connected
      2024-11-01 14:21:13.853 2552 DEBUG ovsdbapp.backend.ovs_idl.event [-] Matched CREATE: ChassisPrivateCreateEvent(events=('create',), table='Chassis_Private', conditions=(('name', '=', '4987f39a-6961-4648-a629-720bf7a4984c'),), old_conditions=None) to row=Chassis_Private(chassis=[<ovs.db.idl.Row object at 0x7f94403f7400>], external_ids={'neutron:ovn-metadata-id': 'df0a89ff-eb2c-52eb-90db-86f033694e23', 'neutron:ovn-metadata-sb-cfg': '69'}, name=4987f39a-6961-4648-a629-720bf7a4984c, nb_cfg_timestamp=1730470621792, nb_cfg=69) old= matches /usr/lib/python3.9/site-packages/ovsdbapp/backend/ovs_idl/event.py:43
      2024-11-01 14:22:13.608 2552 DEBUG oslo_concurrency.lockutils [-] Lock "_check_child_processes" acquired by "neutron.agent.linux.external_process.ProcessMonitor._check_child_processes" :: waited 0.000s inner /usr/lib/python3.9/site-packages/oslo_concurrency/lockutils.py:355
      2024-11-01 14:22:13.609 2552 DEBUG oslo_concurrency.lockutils [-] Lock "_check_child_processes" released by "neutron.agent.linux.external_process.ProcessMonitor._check_child_processes" :: held 0.001s inner /usr/lib/python3.9/site-packages/oslo_concurrency/lockutils.py:367
      2024-11-01 14:23:13.613 2552 DEBUG oslo_concurrency.lockutils [-] Lock "_check_child_processes" acquired by "neutron.agent.linux.external_process.ProcessMonitor._check_child_processes" :: waited 0.000s inner /usr/lib/python3.9/site-packages/oslo_concurrency/lockutils.py:355
      2024-11-01 14:23:13.614 2552 DEBUG oslo_concurrency.lockutils [-] Lock "_check_child_processes" released by "neutron.agent.linux.external_process.ProcessMonitor._check_child_processes" :: held 0.001s inner /usr/lib/python3.9/site-packages/oslo_concurrency/lockutils.py:367

      2) cmp-2-1
      After that ChassisPrivateCreateEvent , it sends that Subscribe and runs some other stuff (I mean it doesn't seem to get stuck at any lock)
      http://rhos-ci-logs.lab.eng.tlv2.redhat.com/logs/rcj/DFG-network-ovn-bgp-agent-17.1_director-rhel-virthost-basic_topology-ipv4-geneve-composable/38/cmp-2-1/var/log/containers/neutron/ovn-metadata-agent.log.gz
      2024-11-01 14:22:24.796 2353 DEBUG ovsdbapp.backend.ovs_idl.event [-] Matched CREATE: ChassisPrivateCreateEvent(events=('create',), table='Chassis_Private', conditions=(('name', '=', '1f8d7d20-42b3-4749-9168-89f6ea9b2b20'),), old_conditions=None) to row=Chassis_Private(chassis=[<ovs.db.idl.Row object at 0x7f024f83ae20>], external_ids={'neutron:ovn-metadata-id': 'b399e02b-5550-5dc2-92d1-20e0abc6de30', 'neutron:ovn-metadata-sb-cfg': '69'}, name=1f8d7d20-42b3-4749-9168-89f6ea9b2b20, nb_cfg_timestamp=1730470621769, nb_cfg=69) old= matches /usr/lib/python3.9/site-packages/ovsdbapp/backend/ovs_idl/event.py:43
      2024-11-01 14:22:24.803 2353 DEBUG neutron_lib.callbacks.manager [-] Subscribe: <bound method MetadataProxyHandler.post_fork_initialize of <neutron.agent.ovn.metadata.server.MetadataProxyHandler object at 0x7f024f919c70>> process after_init 55550000, False subscribe /usr/lib/python3.9/site-packages/neutron_lib/callbacks/manager.py:52
      2024-11-01 14:22:24.805 2353 DEBUG oslo_concurrency.lockutils [-] Acquired lock "singleton_lock" lock /usr/lib/python3.9/site-packages/oslo_concurrency/lockutils.py:266
      2024-11-01 14:22:24.805 2353 DEBUG oslo_concurrency.lockutils [-] Releasing lock "singleton_lock" lock /usr/lib/python3.9/site-packages/oslo_concurrency/lockutils.py:282
      2024-11-01 14:22:24.808 2353 INFO oslo_service.service [-] Starting 2 workers

      Version-Release number of selected component (if applicable):
      RHOS-17.1-RHEL-9-20241030.n.1

      How reproducible:
      reproduced only once; after talking to jlivosba, it seems this issue is very unlikely

      Steps to Reproduce:
      1. reboot compute nodes
      2. run 'openstack network agent list'

              jlibosva Jakub Libosvar
              jlibosva Jakub Libosvar
              Eran Kuris Eran Kuris
              rhos-dfg-networking-squad-neutron
              Votes:
              0 Vote for this issue
              Watchers:
              1 Start watching this issue

                Created:
                Updated: