Uploaded image for project: 'AMQ Interconnect'
  1. AMQ Interconnect
  2. ENTMQIC-3307

Interconnect Router Deployed on Openshift 4 Stops Routing after Node Eviction

    XMLWordPrintable

Details

    • Bug
    • Resolution: Unresolved
    • Major
    • None
    • 1.10.1.GA
    • Qpid Dispatch Router
    • None
    • False
    • None
    • False
    • Hide
      • Create a 3-node mesh (interior mode) on OpenShift, making sure the routers are distributed across at least 2 application nodes. The routers in this instance were connecting to an external AMQ broker
      • Connect consumer applications to the service endpoint for the router mesh, also distributed across the application nodes
      • Initiate a drain on the application node to evict the pods
      • Wait until the migrated pod is back up on one of the remaining nodes
      • You should see the routing errors and unsettled message warnings in the logs of the migrated router.
      Show
      Create a 3-node mesh (interior mode) on OpenShift, making sure the routers are distributed across at least 2 application nodes. The routers in this instance were connecting to an external AMQ broker Connect consumer applications to the service endpoint for the router mesh, also distributed across the application nodes Initiate a drain on the application node to evict the pods Wait until the migrated pod is back up on one of the remaining nodes You should see the routing errors and unsettled message warnings in the logs of the migrated router.

    Description

      In a 3-router mesh deployment on OpenShift 4.8.35, we observe the following behavior when we drain an application node containing one of the routers.

      1. The remaining routers that are not moved continue to function as normal
      2. The moved router starts continuously logging "no route to host" warnings in the logs, in the following format:

      2022-05-13 19:32:56.598748 +0000 SERVER (info) [C113] Connection to 10.210.46.90:55672 failed: proton:io No route to host - disconnected 10.210.46.90:55672
      

      3. The IP address in these log entries is the former address of the moved router (e.g. as if the router is trying to connect to itself on its old IP address)
      4. We can see applications connect to the router, but it appears deliveries remain stuck / unsettled for these connections:

      2022-05-13 19:36:07.127228 +0000 ROUTER_CORE (info) [C2][L40] Stuck delivery: At least one delivery on this link has been undelivered/unsettled for more than 10 seconds
      

      It appears that somewhere the old IP address of the router is not removed and the router is attempting the add a connector to its old IP address. It is unclear whether this is related to the issue with unsettled deliveries or is just another manifestation of the underlying cause.

      Other notes: Along with one of the routers, several of the client application pods were also migrated.

      Restarting / killing the router seems to resolve the issue - when it comes back, message flow resumes.

      Attachments

        Activity

          People

            mcressma@redhat.com Michael Cressman
            rhn-support-dhawkins Duane Hawkins
            Votes:
            0 Vote for this issue
            Watchers:
            3 Start watching this issue

            Dates

              Created:
              Updated: