-
Bug
-
Resolution: Duplicate
-
Major
-
None
-
None
-
None
-
Bug Tracking
-
0
-
False
-
-
False
-
?
-
rhos-workloads-evolution
-
None
-
-
-
-
Important
Summary
Watcher applier fails to monitor live migration progress with ConnectFailure: Remote end closed connection without response error. The migration completes successfully but Watcher incorrectly reports it as failed due to a race condition between Nova's HTTP Keep-Alive timeout (5 seconds) and Watcher's polling interval (5 seconds).
Impact
- Affected Component: Watcher applier (live migration monitoring)
- User Impact: Watcher incorrectly reports successful migrations as failed, causing action plans to fail even when the underlying Nova operation succeeds
- Workaround: Migration completes successfully despite the error; Watcher can retry
Detailed Description
When Watcher applier triggers a live migration, it monitors the migration progress by polling Nova API every ~5 seconds. Apache httpd serving Nova API uses the default HTTP Keep-Alive timeout of 5 seconds. This creates a race condition where urllib3's connection pool reuses a connection that has expired on the server side, resulting in a RemoteDisconnected error.
Timeline of Events
06:27:38.621 - Watcher initiates live migration for instance bbcb12e6-ebf7-49e2-847a-65f1b3a3266c
06:27:39.099 - POST /v2.1/servers/.../action returns HTTP 202 (migration accepted)
06:27:39.456 - GET /v2.1/servers/... returns HTTP 200 (status: MIGRATING)
Keep-Alive: timeout=5, max=96
06:27:44.459 - Connection reset detected: "Resetting dropped connection: nova-internal.openstack.svc"
06:27:45.091 - GET /v2.1/servers/... returns HTTP 200 (new connection established)
Keep-Alive: timeout=5, max=100
06:27:50.095 - Watcher attempts GET (exactly 5 seconds after last request)
06:27:50.096 - ERROR: Remote end closed connection without response
Meanwhile, the migration actually completes successfully:
06:27:54 - Live migration initiated on compute-1
06:27:58 - Migration operation has completed ✓
06:27:58 - _post_live_migration() started
06:27:59 - Activated binding for port on compute-0
Error Logs
Watcher Applier Error (watcher-applier.log:4265-4277)
2025-12-05 06:27:44.459 1 DEBUG urllib3.connectionpool [None req-ced62887-1ad4-4c3c-a7ae-96f7ffc873ad - - - - - -] Resetting dropped connection: nova-internal.openstack.svc _get_conn /usr/lib/python3.12/site-packages/urllib3/connectionpool.py:291
2025-12-05 06:27:50.095 1 DEBUG novaclient.v2.client [None req-ced62887-1ad4-4c3c-a7ae-96f7ffc873ad - - - - - -] REQ: curl -g -i --cacert "/etc/pki/ca-trust/extracted/pem/tls-ca-bundle.pem" -X GET https://nova-internal.openstack.svc:8774/v2.1/servers/bbcb12e6-ebf7-49e2-847a-65f1b3a3266c -H "Accept: application/json" -H "OpenStack-API-Version: compute 2.56" -H "User-Agent:
python-novaclient" -H "X-Auth-Token: {SHA1}..." -H "X-OpenStack-Nova-API-Version: 2.56" _http_log_request /usr/lib/python3.12/site-packages/keystoneauth1/session.py:572
2025-12-05 06:27:50.096 1 ERROR watcher.applier.actions.migration [None req-ced62887-1ad4-4c3c-a7ae-96f7ffc873ad - - - - - -] Unable to establish connection to https://nova-internal.openstack.svc:8774/v2.1/servers/bbcb12e6-ebf7-49e2-847a-65f1b3a3266c: ('Connection aborted.', RemoteDisconnected('Remote end closed connection without response')):
keystoneauth1.exceptions.connection.ConnectFailure: Unable to establish connection to https://nova-internal.openstack.svc:8774/v2.1/servers/bbcb12e6-ebf7-49e2-847a-65f1b3a3266c: ('Connection aborted.', RemoteDisconnected('Remote end closed connection without response'))
Full Stack Trace (watcher-applier.log:4277-4379)
2025-12-05 06:27:50.096 1 ERROR watcher.applier.actions.migration Traceback (most recent call last):
2025-12-05 06:27:50.096 1 ERROR watcher.applier.actions.migration File "/usr/lib/python3.12/site-packages/urllib3/connectionpool.py", line 462, in _make_request
2025-12-05 06:27:50.096 1 ERROR watcher.applier.actions.migration httplib_response = conn.getresponse()
2025-12-05 06:27:50.096 1 ERROR watcher.applier.actions.migration ^^^^^^^^^^^^^^^^^^
2025-12-05 06:27:50.096 1 ERROR watcher.applier.actions.migration File "/usr/lib64/python3.12/http/client.py", line 1430, in getresponse
2025-12-05 06:27:50.096 1 ERROR watcher.applier.actions.migration response.begin()
2025-12-05 06:27:50.096 1 ERROR watcher.applier.actions.migration File "/usr/lib64/python3.12/http/client.py", line 331, in begin
2025-12-05 06:27:50.096 1 ERROR watcher.applier.actions.migration version, status, reason = self._read_status()
2025-12-05 06:27:50.096 1 ERROR watcher.applier.actions.migration ^^^^^^^^^^^^^^^^^^^
2025-12-05 06:27:50.096 1 ERROR watcher.applier.actions.migration File "/usr/lib64/python3.12/http/client.py", line 300, in _read_status
2025-12-05 06:27:50.096 1 ERROR watcher.applier.actions.migration raise RemoteDisconnected("Remote end closed connection without"
2025-12-05 06:27:50.096 1 ERROR watcher.applier.actions.migration http.client.RemoteDisconnected: Remote end closed connection without response
[... urllib3 and requests exception handling ...]
2025-12-05 06:27:50.096 1 ERROR watcher.applier.actions.migration keystoneauth1.exceptions.connection.ConnectFailure: Unable to establish connection to https://nova-internal.openstack.svc:8774/v2.1/servers/bbcb12e6-ebf7-49e2-847a-65f1b3a3266c: ('Connection aborted.', RemoteDisconnected('Remote end closed connection without response'))
2025-12-05 06:27:50.099 1 CRITICAL watcher.applier.actions.migration [None req-ced62887-1ad4-4c3c-a7ae-96f7ffc873ad - - - - - -] Unexpected error occurred. Migration failed for instance bbcb12e6-ebf7-49e2-847a-65f1b3a3266c. Leaving instance on previous host.: keystoneauth1.exceptions.connection.ConnectFailure: Unable to establish connection to
https://nova-internal.openstack.svc:8774/v2.1/servers/bbcb12e6-ebf7-49e2-847a-65f1b3a3266c: ('Connection aborted.', RemoteDisconnected('Remote end closed connection without response'))
HTTP Keep-Alive Headers from Nova API (watcher-applier.log:4247, 4269)
2025-12-05 06:27:39.279 1 DEBUG novaclient.v2.client RESP: [200] Connection: Keep-Alive Content-Length: 2196 Content-Type: application/json Date: Fri, 05 Dec 2025 06:27:39 GMT Keep-Alive: timeout=5, max=97 OpenStack-API-Version: compute 2.56 Server: Apache
2025-12-05 06:27:45.092 1 DEBUG novaclient.v2.client RESP: [200] Connection: Keep-Alive Content-Length: 2196 Content-Type: application/json Date: Fri, 05 Dec 2025 06:27:44 GMT Keep-Alive: timeout=5, max=100 OpenStack-API-Version: compute 2.56 Server: Apache
Successful Migration Completion (compute-1 sosreport messages log)
Dec 5 01:27:58 np0005546357 nova_compute[186330]: 2025-12-05 06:27:58.075 186334 INFO nova.virt.libvirt.driver [None req-785cfdf8-239e-4e33-973e-3cf70c84496e e80eb9b0343d45d5892eedc9dac67ae8 d8fe610270ef4e7f8f4c5bb46d2f9b58 - - default default] [instance: bbcb12e6-ebf7-49e2-847a-65f1b3a3266c] Migration operation has completed
Dec 5 01:27:58 np0005546357 nova_compute[186330]: 2025-12-05 06:27:58.075 186334 INFO nova.compute.manager [None req-785cfdf8-239e-4e33-973e-3cf70c84496e e80eb9b0343d45d5892eedc9dac67ae8 d8fe610270ef4e7f8f4c5bb46d2f9b58 - - default default] [instance: bbcb12e6-ebf7-49e2-847a-65f1b3a3266c] _post_live_migration() is started..
Root Cause Analysis
The Race Condition
1. Apache httpd configuration: Nova API uses Apache with default KeepAliveTimeout 5 (seconds)
2. Watcher polling interval: Polls Nova API approximately every 5 seconds (watcher-applier.log:4255, 4272)
3. urllib3 connection pooling: Reuses TCP connections from the pool without pre-flight checks
4. Timing conflict: When Watcher polls exactly at the 5-second boundary, the server has already closed the connection but the client hasn't detected it yet
Configuration Evidence
Apache httpd configuration file (nova-api-config-data.yaml-httpd.conf) does not explicitly set KeepAlive parameters, meaning it uses Apache 2.4 defaults:
# File: nova-api-config-data.yaml-httpd.conf
# Lines 31-74: VirtualHost configuration for nova-internal.openstack.svc
# NO KeepAlive directives present - using Apache defaults:
# KeepAlive On
# KeepAliveTimeout 5
# MaxKeepAliveRequests 100
Reproduction Steps
1. Deploy OpenStack with Watcher enabled
2. Create multiple instances on the same compute node
3. Execute Watcher workload balancing strategy to trigger live migration
4. Observe Watcher applier logs during migration monitoring
5. Error occurs when polling interval aligns with Keep-Alive timeout boundary
Expected Behavior
Watcher should successfully monitor the migration and report accurate status regardless of HTTP Keep-Alive timeout values.
Actual Behavior
Watcher reports migration failure even though the migration completes successfully on Nova side.
Additional Notes
- This issue affects other OpenStack services using Apache with the same default configuration (Keystone, etc.)
- All services in the deployment show Keep-Alive: timeout=5, max=100 in their HTTP responses
- The bug is intermittent and depends on precise timing alignment between client and server timeouts
References
- Apache httpd KeepAlive documentation: https://httpd.apache.org/docs/2.4/mod/core.html#keepalive
- urllib3 connection pooling: https://urllib3.readthedocs.io/en/stable/advanced-usage.html#connection-pooling
- Python http.client RemoteDisconnected: https://docs.python.org/3/library/http.client.html
Bug Report assisted by Claude
- duplicates
-
OSPRH-23816 Migration status get fails on ('Connection aborted.', RemoteDisconnected('Remote end closed connection without response'))
-
- Verified
-
- links to