-
Bug
-
Resolution: Done
-
Major
-
rhos-18.0.0
-
1
-
False
-
False
-
openstack-nova-27.1.1-18.0.20230930093334.a869ab1.el9ost
-
None
-
Undefined
-
Moderate
+++ This bug was initially created as a clone of Bug #1955031 +++
Description of problem:
In case Neutron is slow and activate step from https://specs.openstack.org/openstack/neutron-specs/_images/seqdiag-7d93af8770687c1a95685208b8e6ca0553a8fb25.png fails, the multiple port binding is left in the Neutron database and VM ends up in ERROR state.
Even though the activate is successful on Neutron side (but slow) and live-migration was successful, the things are left in inconsistent broken state.
Version-Release number of selected component (if applicable):
OSP 16.1.5
How reproducible:
Always
Steps to Reproduce:
1. Put some sleep, like 120 seconds, in Neutron activate method: https://github.com/openstack/neutron/blob/150396625aaa870640a6fb9d636aeb90cea4da3e/neutron/plugins/ml2/plugin.py#L2556
2. Run live migration
3. Observe ml2_port_bindings table for the VM port
Actual results:
Two port bindings remain and VM is in error state
Expected results:
Nova can be more resilient to failures in active (probably trying indefinitely to activate the port) or have a mechanism to finish migration in such state, because everything is actually correct from the VM standpoint.
Additional info:
— Additional comment from ffernand on 2021-04-29 13:14:12 UTC —
For sake of clarity, I thought of attaching a sample output of what it looks like in the neutron logs, as well as the offending rows in the neutron db.
2021-04-28 20:33:47.251 39 ERROR neutron.plugins.ml2.managers oslo_db.exception.DBDuplicateEntry: (pymysql.err.IntegrityError) (1062, "Duplicate entry '1f0da7ae-2c34-4ea7-b799-ceb81c1b37d3-compute-d-087.localdomain' for key 'PRIMARY'")
2021-04-28 20:33:47.251 39 ERROR neutron.plugins.ml2.managers [SQL: UPDATE ml2_port_bindings SET host=%(host)s, vif_type=%(vif_type)s, vif_details=%(vif_details)s WHERE ml2_port_bindings.port_id = %(ml2_port_bindings_port_id)s AND ml2_port_bindings.host = %(ml2_port_bindings_host)s]
2021-04-28 20:33:47.251 39 ERROR neutron.plugins.ml2.managers [parameters:
{'host': 'compute-d-087.localdomain', 'vif_type': 'unbound', 'vif_details': '', 'ml2_port_bindings_port_id': '1f0da7ae-2c34-4ea7-b799-ceb81c1b37d3', 'ml2_port_bindings_host': 'compute-d-013.localdomain'}]
DB looks like:
MariaDB [ovs_neutron]> select * from ml2_port_bindings where port_id="1f0da7ae-2c34-4ea7-b799-ceb81c1b37d3";
----------------------------------------------------------------------------------------------------------------------------------------------------------
| port_id | host | vif_type | vnic_type | profile | vif_details | status |
----------------------------------------------------------------------------------------------------------------------------------------------------------
| 1f0da7ae-2c34-4ea7-b799-ceb81c1b37d3 | compute-d-013.localdomain | ovs | normal | {"port_filter": true} | ACTIVE | |
| 1f0da7ae-2c34-4ea7-b799-ceb81c1b37d3 | compute-d-087.localdomain | unbound | normal | {"migrating_to": "compute-d-013.localdomain"} | INACTIVE |
----------------------------------------------------------------------------------------------------------------------------------------------------------
— Additional comment from Artom Lifshitz on 2021-05-03 16:36:21 UTC —
Nova does have the [neutron]http_retries config option, that one can set to any number > 0 to make Nova retry all calls to the Neutron API if the initial request errors out.
I'm not sure what else Nova can do here - sure, once we're at the point that we have to activate the port binding on the destination everything is set up correctly from Nova's POV, but until we get confirmation from Neutron that the port binding has been updated, we can't reasonably move the instance back to ACTIVE, so I feel like setting it to ERROR if the port binding update fails is the correct thing do. We'll discuss this with the wider team on our bug triage call on Wednesday.
— Additional comment from melanie witt on 2021-05-03 19:45:47 UTC —
(In reply to Artom Lifshitz from comment #2)
> Nova does have the [neutron]http_retries config option, that one can set to
> any number > 0 to make Nova retry all calls to the Neutron API if the
> initial request errors out.
Yeah, this defaults to 3 retries and it does apply to a timeout situation (the retries handle "ConnectionError" which can be things like timeout, SSL error, connection refused, etc) [1].
But if neutron is being consistently slow, then it's possible all 4 attempts will still timeout each time.
It's not clear to me what's the behavior of the API if the binding is already activated and we keep doing the PUT to activate it? The API is unfortunately not documented [2].
I'm not sure whether this is a good idea but I wonder if we could do something like catch a timeout exception from the PUT to activate the port and in that case, do a GET for the port bindings and check whether the destination binding is ACTIVE and the source INACTIVE (check whether the activate actually did work) and if so, go ahead and put the instance into ACTIVE state as a successfully completed migration.
[1] https://github.com/openstack/keystoneauth/blob/112bcae1fbec355fcb58da07ae1d9f1adf8b77ba/keystoneauth1/exceptions/connection.py#L38
[2] https://bugs.launchpad.net/neutron/+bug/1853873
— Additional comment from melanie witt on 2021-05-07 23:35:00 UTC —
Hi,
If you still have access to it, could you please attach the nova-compute.log from when this happened to this BZ? It would help to be able to confirm whether nova, on the second attempt to activate the port binding, (a) timed out on the REST call to neutron API or (b) neutron API returned an error code.
— Additional comment from melanie witt on 2021-05-18 03:00:11 UTC —
(In reply to ffernand from comment #1)
> For sake of clarity, I thought of attaching a sample output of what it looks
> like in the neutron logs, as well as the offending rows in the neutron db.
>
>
> 2021-04-28 20:33:47.251 39 ERROR neutron.plugins.ml2.managers
> oslo_db.exception.DBDuplicateEntry: (pymysql.err.IntegrityError) (1062,
> "Duplicate entry
> '1f0da7ae-2c34-4ea7-b799-ceb81c1b37d3-compute-d-087.localdomain' for key
> 'PRIMARY'")
> 2021-04-28 20:33:47.251 39 ERROR neutron.plugins.ml2.managers [SQL: UPDATE
> ml2_port_bindings SET host=%(host)s, vif_type=%(vif_type)s,
> vif_details=%(vif_details)s WHERE ml2_port_bindings.port_id =
> %(ml2_port_bindings_port_id)s AND ml2_port_bindings.host =
> %(ml2_port_bindings_host)s](
>
>
> 2021-04-28 20:33:47.251 39 ERROR neutron.plugins.ml2.managers [parameters:
>
]
>
>
>
> DB looks like:
>
> MariaDB [ovs_neutron]> select * from ml2_port_bindings where
> port_id="1f0da7ae-2c34-4ea7-b799-ceb81c1b37d3";
> --------------------------------------------------------------+--------
> ------------------------------------------------------+--------------
> ----------------
> | port_id | host |
> vif_type | vnic_type | profile |
> vif_details | status |
> --------------------------------------------------------------+--------
> ------------------------------------------------------+--------------
> ----------------
> | 1f0da7ae-2c34-4ea7-b799-ceb81c1b37d3 | compute-d-013.localdomain | ovs
> | normal | |
>
| ACTIVE |
> | 1f0da7ae-2c34-4ea7-b799-ceb81c1b37d3 | compute-d-087.localdomain | unbound
> | normal |
|
> | INACTIVE |
> --------------------------------------------------------------+--------
> ------------------------------------------------------+--------------
> ----------------
Thanks for this output. It looks like what happened here is there was an attempt in neutron to deactivate the port binding for the source (compute-d-087.localdomain) again, when it had already been deactivated. I wonder if this is because of a retry to activate the port binding on the destination (which would also deactivate the port binding on the source) from the nova side?
I'm going to repro this to verify the detail of what's happening.
— Additional comment from melanie witt on 2021-05-26 01:52:25 UTC —
I have reproduced the issue in a local devstack by doing the following patch as suggested in comment 0:
— a/neutron/plugins/ml2/plugin.py
+++ b/neutron/plugins/ml2/plugin.py
@@ -13,6 +13,8 @@
- License for the specific language governing permissions and limitations
- under the License.
+import time
+
from eventlet import greenthread
import netaddr
from netaddr.strategy import eui48
@@ -2566,6 +2568,7 @@ class Ml2Plugin(db_base_plugin_v2.NeutronDbPluginV2,
cur_context, _, try_again = self._commit_port_binding(
original_context, bind_context, need_notify=True,
update_binding_levels=False)
+ time.sleep(120)
if not try_again:
self.notifier.binding_deactivate(context, port_id,
active_binding.host,
And then doing live migrations:
(openstack) server create --image 57e9f98f-c7c9-4aac-a289-95ea6bcbcc83 --flavor c1 eight
(openstack) server migrate --live --block-migration eight
I have found that the behavior on the nova side is actually a bug where our [neutron]http_retries config option is not being honored in the case of the neutron port binding API. In our neutron API client code, we actually have two different ways of creating a client to talk to neutron [1] and when the [neutron]http_retries config option was added to nova, the second client (that calls the neutron port binding API) was missed for adding the default retry. So, the call to port binding API /activate is not doing any retries by default.
You can work around the bug by adding to your nova.conf on the compute hosts:
[neutron]
connect_retries = 3
and the keystoneauth code we use in the clients will pick up the setting automatically.
When retries are actually being done, nova will timeout during the first call and then retry and get a 409 Conflict response from neutron which nova treats as a successful port binding activation and completes the live migration without error. These excerpts show the behavior with [neutron]connect_retries = 3 on the nova side and the neutron side:
n-cpu:
May 26 01:09:14 ubuntu-subnode nova-compute[411810]: INFO keystoneauth.session [None req-9327f0c5-e70f-4d54-93af-cae935d1e3bd None None] Failure: Request to http://192.168.33.10:9696/v2.0/ports/d235be7e-6bf8-4403-afbf-ab2963e6574b/bindings/ubuntu-controller/activate timed out. Retrying in 0.5s.
May 26 01:09:15 ubuntu-subnode nova-compute[411810]: WARNING nova.network.neutron [None req-9327f0c5-e70f-4d54-93af-cae935d1e3bd None None] Binding for port d235be7e-6bf8-4403-afbf-ab2963e6574b and host ubuntu-controller is already active.
q-svc:
May 26 01:09:15 ubuntu-controller neutron-server[144111]: INFO neutron.wsgi [req-8160f605-287b-494d-bb22-db3f8b3ee756 req-d439d52a-c18a-43fc-9c7d-79087b44dbae service neutron] 192.168.33.11 "PUT /v2.0/por
ts/d235be7e-6bf8-4403-afbf-ab2963e6574b/bindings/ubuntu-controller/activate HTTP/1.1" status: 409 len: 386 time: 0.1463201
[...]
May 26 01:10:45 ubuntu-controller neutron-server[144112]: INFO neutron.wsgi [req-8160f605-287b-494d-bb22-db3f8b3ee756 req-17d0d362-c8a1-4f67-bd49-0c68f50715ef service neutron] 192.168.33.11 "PUT /v2.0/ports/d235be7e-6bf8-4403-afbf-ab2963e6574b/bindings/ubuntu-controller/activate HTTP/1.1" status: 200 len: 443 time: 120.8294811
On the neutron side, we can see here ^ that the 409 Conflict reply actually happens before the original first request from nova because of the sleep(120) that was patched in to reproduce the issue.
To fix the nova bug, we need to pass CONF.neutron.http_retries as connect_retries to the keystoneauth based client for calling the port binding API.
— Additional comment from melanie witt on 2021-05-26 02:10:00 UTC —
I found that the repeating neutron DuplicateEntry error is actually caused by a neutron side retry sequence:
q-svc:
May 26 00:59:49 ubuntu-controller neutron-server[144110]: DEBUG neutron.plugins.ml2.managers [req-7f1bf57f-6fdb-4014-9abd-a7317f3c54e2 req-30feb9f0-e0fa-4f48-802b-31e78181805a service neutron] DB exception raised by Mechanism driver 'openvswitch' in update_port_precommit (pid=144110) _call_on_drivers /opt/stack/neutron/neutron/plugins/ml2/managers.py:481
May 26 00:59:49 ubuntu-controller neutron-server[144110]: ERROR neutron.plugins.ml2.managers Traceback (most recent call last):
May 26 00:59:49 ubuntu-controller neutron-server[144110]: ERROR neutron.plugins.ml2.managers File "/usr/local/lib/python3.8/dist-packages/sqlalchemy/engine/base.py", line 1276, in _execute_context
May 26 00:59:49 ubuntu-controller neutron-server[144110]: ERROR neutron.plugins.ml2.managers self.dialect.do_execute(
May 26 00:59:49 ubuntu-controller neutron-server[144110]: ERROR neutron.plugins.ml2.managers File "/usr/local/lib/python3.8/dist-packages/sqlalchemy/engine/default.py", line 608, in do_execute
May 26 00:59:49 ubuntu-controller neutron-server[144110]: ERROR neutron.plugins.ml2.managers cursor.execute(statement, parameters)
May 26 00:59:49 ubuntu-controller neutron-server[144110]: ERROR neutron.plugins.ml2.managers File "/usr/local/lib/python3.8/dist-packages/pymysql/cursors.py", line 148, in execute
May 26 00:59:49 ubuntu-controller neutron-server[144110]: ERROR neutron.plugins.ml2.managers result = self._query(query)
May 26 00:59:49 ubuntu-controller neutron-server[144110]: ERROR neutron.plugins.ml2.managers File "/usr/local/lib/python3.8/dist-packages/pymysql/cursors.py", line 310, in _query
May 26 00:59:49 ubuntu-controller neutron-server[144110]: ERROR neutron.plugins.ml2.managers conn.query(q)
May 26 00:59:49 ubuntu-controller neutron-server[144110]: ERROR neutron.plugins.ml2.managers File "/usr/local/lib/python3.8/dist-packages/pymysql/connections.py", line 548, in query
May 26 00:59:49 ubuntu-controller neutron-server[144110]: ERROR neutron.plugins.ml2.managers self._affected_rows = self._read_query_result(unbuffered=unbuffered)
May 26 00:59:49 ubuntu-controller neutron-server[144110]: ERROR neutron.plugins.ml2.managers File "/usr/local/lib/python3.8/dist-packages/pymysql/connections.py", line 775, in _read_query_result
May 26 00:59:49 ubuntu-controller neutron-server[144110]: ERROR neutron.plugins.ml2.managers result.read()
May 26 00:59:49 ubuntu-controller neutron-server[144110]: ERROR neutron.plugins.ml2.managers File "/usr/local/lib/python3.8/dist-packages/pymysql/connections.py", line 1156, in read
May 26 00:59:49 ubuntu-controller neutron-server[144110]: ERROR neutron.plugins.ml2.managers first_packet = self.connection._read_packet()
May 26 00:59:49 ubuntu-controller neutron-server[144110]: ERROR neutron.plugins.ml2.managers File "/usr/local/lib/python3.8/dist-packages/pymysql/connections.py", line 725, in _read_packet
May 26 00:59:49 ubuntu-controller neutron-server[144110]: ERROR neutron.plugins.ml2.managers packet.raise_for_error()
May 26 00:59:49 ubuntu-controller neutron-server[144110]: ERROR neutron.plugins.ml2.managers File "/usr/local/lib/python3.8/dist-packages/pymysql/protocol.py", line 221, in raise_for_error
May 26 00:59:49 ubuntu-controller neutron-server[144110]: ERROR neutron.plugins.ml2.managers err.raise_mysql_exception(self._data)
May 26 00:59:49 ubuntu-controller neutron-server[144110]: ERROR neutron.plugins.ml2.managers File "/usr/local/lib/python3.8/dist-packages/pymysql/err.py", line 143, in raise_mysql_exception
May 26 00:59:49 ubuntu-controller neutron-server[144110]: ERROR neutron.plugins.ml2.managers raise errorclass(errno, errval)
May 26 00:59:49 ubuntu-controller neutron-server[144110]: ERROR neutron.plugins.ml2.managers pymysql.err.IntegrityError: (1062, "Duplicate entry '17c5e5e6-be12-44b0-898f-80e0d731fd71-ubuntu-subnode' for k
ey 'ml2_port_bindings.PRIMARY'")
[...]
May 26 00:59:49 ubuntu-controller neutron-server[144110]: ERROR neutron.plugins.ml2.managers [SQL: UPDATE ml2_port_bindings SET host=%(host)s, vif_type=%(vif_type)s, vif_details=%(vif_details)s WHERE ml2_port_bindings.port_id = %(ml2_port_bindings_port_id)s AND ml2_port_bindings.host = %(ml2_port_bindings_host)s]
May 26 00:59:49 ubuntu-controller neutron-server[144110]: ERROR neutron.plugins.ml2.managers [parameters:
]
May 26 00:59:49 ubuntu-controller neutron-server[144110]: ERROR neutron.plugins.ml2.managers (Background on this error at: http://sqlalche.me/e/13/gkpj)
May 26 00:59:49 ubuntu-controller neutron-server[144110]: ERROR neutron.plugins.ml2.managers
May 26 00:59:49 ubuntu-controller neutron-server[144110]: DEBUG neutron_lib.db.api [req-7f1bf57f-6fdb-4014-9abd-a7317f3c54e2 req-30feb9f0-e0fa-4f48-802b-31e78181805a service neutron] Retry wrapper got retriable exception: (pymysql.err.IntegrityError) (1062, "Duplicate entry '17c5e5e6-be12-44b0-898f-80e0d731fd71-ubuntu-subnode' for key 'ml2_port_bindings.PRIMARY'")
May 26 00:59:49 ubuntu-controller neutron-server[144110]: [SQL: UPDATE ml2_port_bindings SET host=%(host)s, vif_type=%(vif_type)s, vif_details=%(vif_details)s WHERE ml2_port_bindings.port_id = %(ml2_port_bindings_port_id)s AND ml2_port_bindings.host = %(ml2_port_bindings_host)s]
May 26 00:59:49 ubuntu-controller neutron-server[144110]: [parameters:
]
May 26 00:59:49 ubuntu-controller neutron-server[144110]: (Background on this error at: http://sqlalche.me/e/13/gkpj) (pid=144110) wrapped /usr/local/lib/python3.8/dist-packages/neutron_lib/db/api.py:183
May 26 00:59:49 ubuntu-controller neutron-server[144110]: DEBUG oslo_db.api [req-7f1bf57f-6fdb-4014-9abd-a7317f3c54e2 req-30feb9f0-e0fa-4f48-802b-31e78181805a service neutron] Performing DB retry for func
tion neutron.plugins.ml2.plugin.Ml2Plugin.update_port (pid=144110) wrapper /usr/local/lib/python3.8/dist-packages/oslo_db/api.py:155
[...]
I think this ^ would be a separate bug to open for the neutron component.
— Additional comment from Jakub Libosvar on 2021-05-26 15:31:34 UTC —
(In reply to melanie witt from comment #7)
> I found that the repeating neutron DuplicateEntry error is actually caused
> by a neutron side retry sequence:
>
> q-svc:
>
> May 26 00:59:49 ubuntu-controller neutron-server[144110]: DEBUG
> neutron.plugins.ml2.managers [req-7f1bf57f-6fdb-4014-9abd-a7317f3c54e2
> req-30feb9f0-e0fa-4f48-802b-31e78181805a service neutron] DB exception
> raised by Mechanism driver 'openvswitch' in update_port_precommit
> {{(pid=144110) _call_on_drivers
> /opt/stack/neutron/neutron/plugins/ml2/managers.py:481}}
> May 26 00:59:49 ubuntu-controller neutron-server[144110]: ERROR
> neutron.plugins.ml2.managers Traceback (most recent call last):
> May 26 00:59:49 ubuntu-controller neutron-server[144110]: ERROR
> neutron.plugins.ml2.managers File
> "/usr/local/lib/python3.8/dist-packages/sqlalchemy/engine/base.py", line
> 1276, in _execute_context
> May 26 00:59:49 ubuntu-controller neutron-server[144110]: ERROR
> neutron.plugins.ml2.managers self.dialect.do_execute(
> May 26 00:59:49 ubuntu-controller neutron-server[144110]: ERROR
> neutron.plugins.ml2.managers File
> "/usr/local/lib/python3.8/dist-packages/sqlalchemy/engine/default.py", line
> 608, in do_execute
> May 26 00:59:49 ubuntu-controller neutron-server[144110]: ERROR
> neutron.plugins.ml2.managers cursor.execute(statement, parameters)
> May 26 00:59:49 ubuntu-controller neutron-server[144110]: ERROR
> neutron.plugins.ml2.managers File
> "/usr/local/lib/python3.8/dist-packages/pymysql/cursors.py", line 148, in
> execute
> May 26 00:59:49 ubuntu-controller neutron-server[144110]: ERROR
> neutron.plugins.ml2.managers result = self._query(query)
> May 26 00:59:49 ubuntu-controller neutron-server[144110]: ERROR
> neutron.plugins.ml2.managers File
> "/usr/local/lib/python3.8/dist-packages/pymysql/cursors.py", line 310, in
> _query
> May 26 00:59:49 ubuntu-controller neutron-server[144110]: ERROR
> neutron.plugins.ml2.managers conn.query(q)
> May 26 00:59:49 ubuntu-controller neutron-server[144110]: ERROR
> neutron.plugins.ml2.managers File
> "/usr/local/lib/python3.8/dist-packages/pymysql/connections.py", line 548,
> in query
> May 26 00:59:49 ubuntu-controller neutron-server[144110]: ERROR
> neutron.plugins.ml2.managers self._affected_rows =
> self._read_query_result(unbuffered=unbuffered)
> May 26 00:59:49 ubuntu-controller neutron-server[144110]: ERROR
> neutron.plugins.ml2.managers File
> "/usr/local/lib/python3.8/dist-packages/pymysql/connections.py", line 775,
> in _read_query_result
> May 26 00:59:49 ubuntu-controller neutron-server[144110]: ERROR
> neutron.plugins.ml2.managers result.read()
> May 26 00:59:49 ubuntu-controller neutron-server[144110]: ERROR
> neutron.plugins.ml2.managers File
> "/usr/local/lib/python3.8/dist-packages/pymysql/connections.py", line 1156,
> in read
> May 26 00:59:49 ubuntu-controller neutron-server[144110]: ERROR
> neutron.plugins.ml2.managers first_packet =
> self.connection._read_packet()
> May 26 00:59:49 ubuntu-controller neutron-server[144110]: ERROR
> neutron.plugins.ml2.managers File
> "/usr/local/lib/python3.8/dist-packages/pymysql/connections.py", line 725,
> in _read_packet
> May 26 00:59:49 ubuntu-controller neutron-server[144110]: ERROR
> neutron.plugins.ml2.managers packet.raise_for_error()
> May 26 00:59:49 ubuntu-controller neutron-server[144110]: ERROR
> neutron.plugins.ml2.managers File
> "/usr/local/lib/python3.8/dist-packages/pymysql/protocol.py", line 221, in
> raise_for_error
> May 26 00:59:49 ubuntu-controller neutron-server[144110]: ERROR
> neutron.plugins.ml2.managers err.raise_mysql_exception(self._data)
> May 26 00:59:49 ubuntu-controller neutron-server[144110]: ERROR
> neutron.plugins.ml2.managers File
> "/usr/local/lib/python3.8/dist-packages/pymysql/err.py", line 143, in
> raise_mysql_exception
> May 26 00:59:49 ubuntu-controller neutron-server[144110]: ERROR
> neutron.plugins.ml2.managers raise errorclass(errno, errval)
> May 26 00:59:49 ubuntu-controller neutron-server[144110]: ERROR
> neutron.plugins.ml2.managers pymysql.err.IntegrityError: (1062, "Duplicate
> entry '17c5e5e6-be12-44b0-898f-80e0d731fd71-ubuntu-subnode' for k
> ey 'ml2_port_bindings.PRIMARY'")
>
> [...]
>
> May 26 00:59:49 ubuntu-controller neutron-server[144110]: ERROR
> neutron.plugins.ml2.managers [SQL: UPDATE ml2_port_bindings SET
> host=%(host)s, vif_type=%(vif_type)s, vif_details=%(vif_details)s WHERE
> ml2_port_bindings.port_id = %(ml2_port_bindings_port_id)s AND
> ml2_port_bindings.host = %(ml2_port_bindings_host)s]
> May 26 00:59:49 ubuntu-controller neutron-server[144110]: ERROR
> neutron.plugins.ml2.managers [parameters:
]
> May 26 00:59:49 ubuntu-controller neutron-server[144110]: ERROR
> neutron.plugins.ml2.managers (Background on this error at:
> http://sqlalche.me/e/13/gkpj)
> May 26 00:59:49 ubuntu-controller neutron-server[144110]: ERROR
> neutron.plugins.ml2.managers
> May 26 00:59:49 ubuntu-controller neutron-server[144110]: DEBUG
> neutron_lib.db.api [req-7f1bf57f-6fdb-4014-9abd-a7317f3c54e2
> req-30feb9f0-e0fa-4f48-802b-31e78181805a service neutron] Retry wrapper got
> retriable exception: (pymysql.err.IntegrityError) (1062, "Duplicate entry
> '17c5e5e6-be12-44b0-898f-80e0d731fd71-ubuntu-subnode' for key
> 'ml2_port_bindings.PRIMARY'")
> May 26 00:59:49 ubuntu-controller neutron-server[144110]: [SQL: UPDATE
> ml2_port_bindings SET host=%(host)s, vif_type=%(vif_type)s,
> vif_details=%(vif_details)s WHERE ml2_port_bindings.port_id =
> %(ml2_port_bindings_port_id)s AND ml2_port_bindings.host =
> %(ml2_port_bindings_host)s]
> May 26 00:59:49 ubuntu-controller neutron-server[144110]: [parameters:
>
]
> May 26 00:59:49 ubuntu-controller neutron-server[144110]: (Background on
> this error at: http://sqlalche.me/e/13/gkpj) {{(pid=144110) wrapped
> /usr/local/lib/python3.8/dist-packages/neutron_lib/db/api.py:183}}
> May 26 00:59:49 ubuntu-controller neutron-server[144110]: DEBUG oslo_db.api
> [req-7f1bf57f-6fdb-4014-9abd-a7317f3c54e2
> req-30feb9f0-e0fa-4f48-802b-31e78181805a service neutron] Performing DB
> retry for func
> tion neutron.plugins.ml2.plugin.Ml2Plugin.update_port {{(pid=144110) wrapper
> /usr/local/lib/python3.8/dist-packages/oslo_db/api.py:155}}
>
> [...]
>
> I think this ^ would be a separate bug to open for the neutron component.
I think this is actually correct behavior. Neutron should retry the port binding multiple times - even if the DB is in bad shape. This is triggered on every port update to fix failed_binding statuses. I'll needinfo Slawek to keep me honest here and to get his opinion.
— Additional comment from melanie witt on 2021-05-26 15:40:04 UTC —
(In reply to Jakub Libosvar from comment #8)
> (In reply to melanie witt from comment #7)
> > I think this ^ would be a separate bug to open for the neutron component.
>
> I think this is actually correct behavior. Neutron should retry the port
> binding multiple times - even if the DB is in bad shape. This is triggered
> on every port update to fix failed_binding statuses. I'll needinfo Slawek to
> keep me honest here and to get his opinion.
Yeah.. what I found odd about the behavior is that it continues to retry forever AFAICT and it didn't need to because the activation eventually worked (delayed by sleep(120)) but it continued to try to apply the same DB update repeatedly anyway, even though the DB was already in the desired state.
— Additional comment from Slawek Kaplonski on 2021-05-27 07:23:44 UTC —
I'm not sure if it should keep trying to update DB forever. IMHO it should stops at some point. Please open new BZ for that and I will try to reproduce such case and take a deeper look.
— Additional comment from Jakub Libosvar on 2021-05-27 07:28:17 UTC —
(In reply to melanie witt from comment #9)
> (In reply to Jakub Libosvar from comment #8)
> > (In reply to melanie witt from comment #7)
> > > I think this ^ would be a separate bug to open for the neutron component.
> >
> > I think this is actually correct behavior. Neutron should retry the port
> > binding multiple times - even if the DB is in bad shape. This is triggered
> > on every port update to fix failed_binding statuses. I'll needinfo Slawek to
> > keep me honest here and to get his opinion.
>
> Yeah.. what I found odd about the behavior is that it continues to retry
> forever AFAICT and it didn't need to because the activation eventually
> worked (delayed by sleep(120)) but it continued to try to apply the same DB
> update repeatedly anyway, even though the DB was already in the desired
> state.
A-ha, I missed that it never stops attempting the DB update. Thanks for clarification.
— Additional comment from Slawek Kaplonski on 2021-06-01 13:59:52 UTC —
@mwitt@redhat.com Hi, I was trying to reproduce the issue and check that error on the neutron server side but I couldn't. I added sleep as suggested in one of the above comments and I was migrating vm but port was always bound to proper host without any issue.
Should I set some specific config options on the nova-compute's side to reproduce that? I already tried with and without:
[neutron]
connect_retries = 3
and the result was the same :/
— Additional comment from melanie witt on 2021-06-01 19:34:13 UTC —
(In reply to Slawek Kaplonski from comment #12)
> @mwitt@redhat.com Hi, I was trying to reproduce the issue and check that
> error on the neutron server side but I couldn't. I added sleep as suggested
> in one of the above comments and I was migrating vm but port was always
> bound to proper host without any issue.
> Should I set some specific config options on the nova-compute's side to
> reproduce that? I already tried with and without:
>
> [neutron]
> connect_retries = 3
>
> and the result was the same :/
Hi Slawek, I wrote the repro steps on comment 6 and it is important that the sleep(120) is added after the _commit_port_binding call, otherwise the behavior will not repro. Also the sleep needs to be sufficiently long, originally I tried 60 and it wasn't long enough to reproduce the behavior. As for the config option, do not set or add [neutron]connect_retries as doing that will prevent repro of the behavior as well.
Is it possible that you tried the sleep in a different code line than was shown in comment 6 or tried a sleep less than 120?
Also note that neutron will still bind the port to the proper host in this scenario but it with a sleep of 120 it will trigger some sort of retrying mechanism even though it bound successfully (eventually).
I hope this helps and please let me know if it doesn't work for you.
- external trackers