Uploaded image for project: 'Red Hat OpenStack Services on OpenShift'
  1. Red Hat OpenStack Services on OpenShift
  2. OSPRH-22898

Migration status get fails on ('Connection aborted.', RemoteDisconnected('Remote end closed connection without response'))

XMLWordPrintable

    • Icon: Bug Bug
    • Resolution: Duplicate
    • Icon: Major Major
    • None
    • None
    • openstack-watcher
    • None
    • Important

      Summary

        Watcher applier fails to monitor live migration progress with ConnectFailure: Remote end closed connection without response error. The migration completes successfully but Watcher incorrectly reports it as failed due to a race condition between Nova's HTTP Keep-Alive timeout (5 seconds) and Watcher's polling interval (5 seconds).

        Impact

        - Affected Component: Watcher applier (live migration monitoring)
        - User Impact: Watcher incorrectly reports successful migrations as failed, causing action plans to fail even when the underlying Nova operation succeeds
        - Workaround: Migration completes successfully despite the error; Watcher can retry

        Detailed Description

        When Watcher applier triggers a live migration, it monitors the migration progress by polling Nova API every ~5 seconds. Apache httpd serving Nova API uses the default HTTP Keep-Alive timeout of 5 seconds. This creates a race condition where urllib3's connection pool reuses a connection that has expired on the server side, resulting in a RemoteDisconnected error.

        Timeline of Events

        06:27:38.621 - Watcher initiates live migration for instance bbcb12e6-ebf7-49e2-847a-65f1b3a3266c
        06:27:39.099 - POST /v2.1/servers/.../action returns HTTP 202 (migration accepted)
        06:27:39.456 - GET /v2.1/servers/... returns HTTP 200 (status: MIGRATING)
                        Keep-Alive: timeout=5, max=96
        06:27:44.459 - Connection reset detected: "Resetting dropped connection: nova-internal.openstack.svc"
        06:27:45.091 - GET /v2.1/servers/... returns HTTP 200 (new connection established)
                        Keep-Alive: timeout=5, max=100
        06:27:50.095 - Watcher attempts GET (exactly 5 seconds after last request)
        06:27:50.096 - ERROR: Remote end closed connection without response

        Meanwhile, the migration actually completes successfully:

        06:27:54 - Live migration initiated on compute-1
        06:27:58 - Migration operation has completed ✓
        06:27:58 - _post_live_migration() started
        06:27:59 - Activated binding for port on compute-0

        Error Logs

        Watcher Applier Error (watcher-applier.log:4265-4277)

        2025-12-05 06:27:44.459 1 DEBUG urllib3.connectionpool [None req-ced62887-1ad4-4c3c-a7ae-96f7ffc873ad - - - - - -] Resetting dropped connection: nova-internal.openstack.svc _get_conn /usr/lib/python3.12/site-packages/urllib3/connectionpool.py:291

        2025-12-05 06:27:50.095 1 DEBUG novaclient.v2.client [None req-ced62887-1ad4-4c3c-a7ae-96f7ffc873ad - - - - - -] REQ: curl -g -i --cacert "/etc/pki/ca-trust/extracted/pem/tls-ca-bundle.pem" -X GET https://nova-internal.openstack.svc:8774/v2.1/servers/bbcb12e6-ebf7-49e2-847a-65f1b3a3266c -H "Accept: application/json" -H "OpenStack-API-Version: compute 2.56" -H "User-Agent:
         python-novaclient" -H "X-Auth-Token: {SHA1}..." -H "X-OpenStack-Nova-API-Version: 2.56" _http_log_request /usr/lib/python3.12/site-packages/keystoneauth1/session.py:572

        2025-12-05 06:27:50.096 1 ERROR watcher.applier.actions.migration [None req-ced62887-1ad4-4c3c-a7ae-96f7ffc873ad - - - - - -] Unable to establish connection to https://nova-internal.openstack.svc:8774/v2.1/servers/bbcb12e6-ebf7-49e2-847a-65f1b3a3266c: ('Connection aborted.', RemoteDisconnected('Remote end closed connection without response')):
        keystoneauth1.exceptions.connection.ConnectFailure: Unable to establish connection to https://nova-internal.openstack.svc:8774/v2.1/servers/bbcb12e6-ebf7-49e2-847a-65f1b3a3266c: ('Connection aborted.', RemoteDisconnected('Remote end closed connection without response'))

        Full Stack Trace (watcher-applier.log:4277-4379)

        2025-12-05 06:27:50.096 1 ERROR watcher.applier.actions.migration Traceback (most recent call last):
        2025-12-05 06:27:50.096 1 ERROR watcher.applier.actions.migration File "/usr/lib/python3.12/site-packages/urllib3/connectionpool.py", line 462, in _make_request
        2025-12-05 06:27:50.096 1 ERROR watcher.applier.actions.migration httplib_response = conn.getresponse()
        2025-12-05 06:27:50.096 1 ERROR watcher.applier.actions.migration ^^^^^^^^^^^^^^^^^^
        2025-12-05 06:27:50.096 1 ERROR watcher.applier.actions.migration File "/usr/lib64/python3.12/http/client.py", line 1430, in getresponse
        2025-12-05 06:27:50.096 1 ERROR watcher.applier.actions.migration response.begin()
        2025-12-05 06:27:50.096 1 ERROR watcher.applier.actions.migration File "/usr/lib64/python3.12/http/client.py", line 331, in begin
        2025-12-05 06:27:50.096 1 ERROR watcher.applier.actions.migration version, status, reason = self._read_status()
        2025-12-05 06:27:50.096 1 ERROR watcher.applier.actions.migration ^^^^^^^^^^^^^^^^^^^
        2025-12-05 06:27:50.096 1 ERROR watcher.applier.actions.migration File "/usr/lib64/python3.12/http/client.py", line 300, in _read_status
        2025-12-05 06:27:50.096 1 ERROR watcher.applier.actions.migration raise RemoteDisconnected("Remote end closed connection without"
        2025-12-05 06:27:50.096 1 ERROR watcher.applier.actions.migration http.client.RemoteDisconnected: Remote end closed connection without response

        [... urllib3 and requests exception handling ...]

        2025-12-05 06:27:50.096 1 ERROR watcher.applier.actions.migration keystoneauth1.exceptions.connection.ConnectFailure: Unable to establish connection to https://nova-internal.openstack.svc:8774/v2.1/servers/bbcb12e6-ebf7-49e2-847a-65f1b3a3266c: ('Connection aborted.', RemoteDisconnected('Remote end closed connection without response'))

        2025-12-05 06:27:50.099 1 CRITICAL watcher.applier.actions.migration [None req-ced62887-1ad4-4c3c-a7ae-96f7ffc873ad - - - - - -] Unexpected error occurred. Migration failed for instance bbcb12e6-ebf7-49e2-847a-65f1b3a3266c. Leaving instance on previous host.: keystoneauth1.exceptions.connection.ConnectFailure: Unable to establish connection to
        https://nova-internal.openstack.svc:8774/v2.1/servers/bbcb12e6-ebf7-49e2-847a-65f1b3a3266c: ('Connection aborted.', RemoteDisconnected('Remote end closed connection without response'))

        HTTP Keep-Alive Headers from Nova API (watcher-applier.log:4247, 4269)

        2025-12-05 06:27:39.279 1 DEBUG novaclient.v2.client RESP: [200] Connection: Keep-Alive Content-Length: 2196 Content-Type: application/json Date: Fri, 05 Dec 2025 06:27:39 GMT Keep-Alive: timeout=5, max=97 OpenStack-API-Version: compute 2.56 Server: Apache

        2025-12-05 06:27:45.092 1 DEBUG novaclient.v2.client RESP: [200] Connection: Keep-Alive Content-Length: 2196 Content-Type: application/json Date: Fri, 05 Dec 2025 06:27:44 GMT Keep-Alive: timeout=5, max=100 OpenStack-API-Version: compute 2.56 Server: Apache

        Successful Migration Completion (compute-1 sosreport messages log)

        Dec 5 01:27:58 np0005546357 nova_compute[186330]: 2025-12-05 06:27:58.075 186334 INFO nova.virt.libvirt.driver [None req-785cfdf8-239e-4e33-973e-3cf70c84496e e80eb9b0343d45d5892eedc9dac67ae8 d8fe610270ef4e7f8f4c5bb46d2f9b58 - - default default] [instance: bbcb12e6-ebf7-49e2-847a-65f1b3a3266c] Migration operation has completed

        Dec 5 01:27:58 np0005546357 nova_compute[186330]: 2025-12-05 06:27:58.075 186334 INFO nova.compute.manager [None req-785cfdf8-239e-4e33-973e-3cf70c84496e e80eb9b0343d45d5892eedc9dac67ae8 d8fe610270ef4e7f8f4c5bb46d2f9b58 - - default default] [instance: bbcb12e6-ebf7-49e2-847a-65f1b3a3266c] _post_live_migration() is started..

        Root Cause Analysis

        The Race Condition

        1. Apache httpd configuration: Nova API uses Apache with default KeepAliveTimeout 5 (seconds)
        2. Watcher polling interval: Polls Nova API approximately every 5 seconds (watcher-applier.log:4255, 4272)
        3. urllib3 connection pooling: Reuses TCP connections from the pool without pre-flight checks
        4. Timing conflict: When Watcher polls exactly at the 5-second boundary, the server has already closed the connection but the client hasn't detected it yet

        Configuration Evidence

        Apache httpd configuration file (nova-api-config-data.yaml-httpd.conf) does not explicitly set KeepAlive parameters, meaning it uses Apache 2.4 defaults:

        # File: nova-api-config-data.yaml-httpd.conf
        # Lines 31-74: VirtualHost configuration for nova-internal.openstack.svc
        # NO KeepAlive directives present - using Apache defaults:
        # KeepAlive On
        # KeepAliveTimeout 5
        # MaxKeepAliveRequests 100

        Reproduction Steps

        1. Deploy OpenStack with Watcher enabled
        2. Create multiple instances on the same compute node
        3. Execute Watcher workload balancing strategy to trigger live migration
        4. Observe Watcher applier logs during migration monitoring
        5. Error occurs when polling interval aligns with Keep-Alive timeout boundary

        Expected Behavior

        Watcher should successfully monitor the migration and report accurate status regardless of HTTP Keep-Alive timeout values.

        Actual Behavior

        Watcher reports migration failure even though the migration completes successfully on Nova side.

        Additional Notes

        - This issue affects other OpenStack services using Apache with the same default configuration (Keystone, etc.)
        - All services in the deployment show Keep-Alive: timeout=5, max=100 in their HTTP responses
        - The bug is intermittent and depends on precise timing alignment between client and server timeouts

        References

        - Apache httpd KeepAlive documentation: https://httpd.apache.org/docs/2.4/mod/core.html#keepalive
        - urllib3 connection pooling: https://urllib3.readthedocs.io/en/stable/advanced-usage.html#connection-pooling
        - Python http.client RemoteDisconnected: https://docs.python.org/3/library/http.client.html

      Bug Report assisted by Claude

              amoralej1@redhat.com Alfredo Moralejo Alonso
              rhn-support-dsanzmor David Sanz Moreno
              David Sanz Moreno David Sanz Moreno
              rhos-workloads-evolution
              Votes:
              0 Vote for this issue
              Watchers:
              1 Start watching this issue

                Created:
                Updated:
                Resolved: