-
Bug
-
Resolution: Unresolved
-
Normal
-
rhos-18.0 Feature Release 1 (Nov 2024)
-
None
Description of problem:
(Related to
https://bugs.launchpad.net/neutron/+bug/2086740
)
(Pending to confirm!!) This bug could be a regression from
https://bugzilla.redhat.com/show_bug.cgi?id=2236159
The tobiko test test_soft_reboot_computes_recovery failed:
http://rhos-ci-logs.lab.eng.tlv2.redhat.com/logs/rcj/DFG-network-ovn-bgp-agent-17.1_director-rhel-virthost-basic_topology-ipv4-geneve-composable/38/test_results/tobiko_faults_1/tobiko_faults_1_03_faults_faults.html?sort=time
It failed because the ovn-metadata-agent from cmp-2-0 didn't recover after that compute was rebooted.
The test soft reboots one compute, waits until it's up again (checks uptime), then reboots the next one, etc.
When all computes have been rebooted, the test performs the list of tobiko health checks including checking neutron agents (sends request to the neutron API). This fails because the ovn-metadata-agent from cmp-2-0 is not "alive" (the test retries for a long time and that agent never recovers).
I have compared ovn-metadata-agent logs from cmp-2-0 (fails) versus cmp-2-1 (recovers).
1) cmp-2-0
After that ChassisPrivateCreateEvent , it gets stuck in that lock forever
http://rhos-ci-logs.lab.eng.tlv2.redhat.com/logs/rcj/DFG-network-ovn-bgp-agent-17.1_director-rhel-virthost-basic_topology-ipv4-geneve-composable/38/cmp-2-0/var/log/containers/neutron/ovn-metadata-agent.log.gz
2024-11-01 14:21:13.663 2552 DEBUG ovsdbapp.backend.ovs_idl [-] Created schema index Datapath_Binding.tunnel_key autocreate_indices /usr/lib/python3.9/site-packages/ovsdbapp/backend/ovs_idl/_init_.py:126
2024-11-01 14:21:13.663 2552 DEBUG ovsdbapp.backend.ovs_idl [-] Created schema index Chassis_Private.name autocreate_indices /usr/lib/python3.9/site-packages/ovsdbapp/backend/ovs_idl/_init_.py:126
2024-11-01 14:21:13.666 2552 INFO ovsdbapp.backend.ovs_idl.vlog [-] ssl:172.20.2.163:6642: connecting...
2024-11-01 14:21:13.674 2552 INFO ovsdbapp.backend.ovs_idl.vlog [-] ssl:172.20.2.163:6642: connected
2024-11-01 14:21:13.853 2552 DEBUG ovsdbapp.backend.ovs_idl.event [-] Matched CREATE: ChassisPrivateCreateEvent(events=('create',), table='Chassis_Private', conditions=(('name', '=', '4987f39a-6961-4648-a629-720bf7a4984c'),), old_conditions=None) to row=Chassis_Private(chassis=[<ovs.db.idl.Row object at 0x7f94403f7400>], external_ids={'neutron:ovn-metadata-id': 'df0a89ff-eb2c-52eb-90db-86f033694e23', 'neutron:ovn-metadata-sb-cfg': '69'}, name=4987f39a-6961-4648-a629-720bf7a4984c, nb_cfg_timestamp=1730470621792, nb_cfg=69) old= matches /usr/lib/python3.9/site-packages/ovsdbapp/backend/ovs_idl/event.py:43
2024-11-01 14:22:13.608 2552 DEBUG oslo_concurrency.lockutils [-] Lock "_check_child_processes" acquired by "neutron.agent.linux.external_process.ProcessMonitor._check_child_processes" :: waited 0.000s inner /usr/lib/python3.9/site-packages/oslo_concurrency/lockutils.py:355
2024-11-01 14:22:13.609 2552 DEBUG oslo_concurrency.lockutils [-] Lock "_check_child_processes" released by "neutron.agent.linux.external_process.ProcessMonitor._check_child_processes" :: held 0.001s inner /usr/lib/python3.9/site-packages/oslo_concurrency/lockutils.py:367
2024-11-01 14:23:13.613 2552 DEBUG oslo_concurrency.lockutils [-] Lock "_check_child_processes" acquired by "neutron.agent.linux.external_process.ProcessMonitor._check_child_processes" :: waited 0.000s inner /usr/lib/python3.9/site-packages/oslo_concurrency/lockutils.py:355
2024-11-01 14:23:13.614 2552 DEBUG oslo_concurrency.lockutils [-] Lock "_check_child_processes" released by "neutron.agent.linux.external_process.ProcessMonitor._check_child_processes" :: held 0.001s inner /usr/lib/python3.9/site-packages/oslo_concurrency/lockutils.py:367
2) cmp-2-1
After that ChassisPrivateCreateEvent , it sends that Subscribe and runs some other stuff (I mean it doesn't seem to get stuck at any lock)
http://rhos-ci-logs.lab.eng.tlv2.redhat.com/logs/rcj/DFG-network-ovn-bgp-agent-17.1_director-rhel-virthost-basic_topology-ipv4-geneve-composable/38/cmp-2-1/var/log/containers/neutron/ovn-metadata-agent.log.gz
2024-11-01 14:22:24.796 2353 DEBUG ovsdbapp.backend.ovs_idl.event [-] Matched CREATE: ChassisPrivateCreateEvent(events=('create',), table='Chassis_Private', conditions=(('name', '=', '1f8d7d20-42b3-4749-9168-89f6ea9b2b20'),), old_conditions=None) to row=Chassis_Private(chassis=[<ovs.db.idl.Row object at 0x7f024f83ae20>], external_ids={'neutron:ovn-metadata-id': 'b399e02b-5550-5dc2-92d1-20e0abc6de30', 'neutron:ovn-metadata-sb-cfg': '69'}, name=1f8d7d20-42b3-4749-9168-89f6ea9b2b20, nb_cfg_timestamp=1730470621769, nb_cfg=69) old= matches /usr/lib/python3.9/site-packages/ovsdbapp/backend/ovs_idl/event.py:43
2024-11-01 14:22:24.803 2353 DEBUG neutron_lib.callbacks.manager [-] Subscribe: <bound method MetadataProxyHandler.post_fork_initialize of <neutron.agent.ovn.metadata.server.MetadataProxyHandler object at 0x7f024f919c70>> process after_init 55550000, False subscribe /usr/lib/python3.9/site-packages/neutron_lib/callbacks/manager.py:52
2024-11-01 14:22:24.805 2353 DEBUG oslo_concurrency.lockutils [-] Acquired lock "singleton_lock" lock /usr/lib/python3.9/site-packages/oslo_concurrency/lockutils.py:266
2024-11-01 14:22:24.805 2353 DEBUG oslo_concurrency.lockutils [-] Releasing lock "singleton_lock" lock /usr/lib/python3.9/site-packages/oslo_concurrency/lockutils.py:282
2024-11-01 14:22:24.808 2353 INFO oslo_service.service [-] Starting 2 workers
Version-Release number of selected component (if applicable):
RHOS-17.1-RHEL-9-20241030.n.1
How reproducible:
reproduced only once; after talking to jlivosba, it seems this issue is very unlikely
Steps to Reproduce:
1. reboot compute nodes
2. run 'openstack network agent list'