Uploaded image for project: 'AMQ Interconnect'
  1. AMQ Interconnect
  2. ENTMQIC-3307

Interconnect Router Deployed on Openshift 4 Stops Routing after Node Eviction

XMLWordPrintable

    • Icon: Bug Bug
    • Resolution: Unresolved
    • Icon: Major Major
    • None
    • 1.10.1.GA
    • Qpid Dispatch Router
    • None
    • False
    • None
    • False
    • Hide
      • Create a 3-node mesh (interior mode) on OpenShift, making sure the routers are distributed across at least 2 application nodes. The routers in this instance were connecting to an external AMQ broker
      • Connect consumer applications to the service endpoint for the router mesh, also distributed across the application nodes
      • Initiate a drain on the application node to evict the pods
      • Wait until the migrated pod is back up on one of the remaining nodes
      • You should see the routing errors and unsettled message warnings in the logs of the migrated router.
      Show
      Create a 3-node mesh (interior mode) on OpenShift, making sure the routers are distributed across at least 2 application nodes. The routers in this instance were connecting to an external AMQ broker Connect consumer applications to the service endpoint for the router mesh, also distributed across the application nodes Initiate a drain on the application node to evict the pods Wait until the migrated pod is back up on one of the remaining nodes You should see the routing errors and unsettled message warnings in the logs of the migrated router.

      In a 3-router mesh deployment on OpenShift 4.8.35, we observe the following behavior when we drain an application node containing one of the routers.

      1. The remaining routers that are not moved continue to function as normal
      2. The moved router starts continuously logging "no route to host" warnings in the logs, in the following format:

      2022-05-13 19:32:56.598748 +0000 SERVER (info) [C113] Connection to 10.210.46.90:55672 failed: proton:io No route to host - disconnected 10.210.46.90:55672
      

      3. The IP address in these log entries is the former address of the moved router (e.g. as if the router is trying to connect to itself on its old IP address)
      4. We can see applications connect to the router, but it appears deliveries remain stuck / unsettled for these connections:

      2022-05-13 19:36:07.127228 +0000 ROUTER_CORE (info) [C2][L40] Stuck delivery: At least one delivery on this link has been undelivered/unsettled for more than 10 seconds
      

      It appears that somewhere the old IP address of the router is not removed and the router is attempting the add a connector to its old IP address. It is unclear whether this is related to the issue with unsettled deliveries or is just another manifestation of the underlying cause.

      Other notes: Along with one of the routers, several of the client application pods were also migrated.

      Restarting / killing the router seems to resolve the issue - when it comes back, message flow resumes.

              mcressma@redhat.com Michael Cressman (Inactive)
              rhn-support-dhawkins Duane Hawkins
              Votes:
              0 Vote for this issue
              Watchers:
              3 Start watching this issue

                Created:
                Updated: