Uploaded image for project: 'WildFly'
  1. WildFly
  2. WFLY-5762

Messaging replication fails to check-for-live-server on restart

XMLWordPrintable

    • Icon: Bug Bug
    • Resolution: Done
    • Icon: Major Major
    • 10.0.0.CR5
    • 10.0.0.CR4
    • JMS
    • None
    • Hide

      Use case:

      • 2 standalone-full-ha servers that forms an Artemis cluster using replication
      • start server1 (replication-master with check-for-live-server=true)
      • start server2 (replication-slave with allow-failback=true)
        => server2 is backup and waits for live server1 to fail
      • kill server1 (with Ctl+C)
        => server2 fails over and becomes live
      • restart server1
        => the server1 restarts as a live server
        => the server2 does not failback and remains as a live server

      We end up with 2 live servers while the expected outcome would be that server1 is the live server and the server2 fails back and becomes again a backup server

      Show
      Use case: 2 standalone-full-ha servers that forms an Artemis cluster using replication start server1 (replication-master with check-for-live-server=true) start server2 (replication-slave with allow-failback=true) => server2 is backup and waits for live server1 to fail kill server1 (with Ctl+C) => server2 fails over and becomes live restart server1 => the server1 restarts as a live server => the server2 does not failback and remains as a live server We end up with 2 live servers while the expected outcome would be that server1 is the live server and the server2 fails back and becomes again a backup server

      The attached configuration use JGroups.

      I had a look at the code and I suspect the issue is located somewhere when the server1 is restarted and calls its SharedNothingLiveActivation#isNodeIdUsed().
      This method returns false and the server completes its live activation instead of setting its HA policy to replicaPolicy.

      Digging into the code, I looks like DiscoveryGroup#received boolean is never set to true because its corresponding JGroupsBroadcastEndpoint never receives any JGroups message.
      I confirm that server2 is working at that time and does send JGroups message.

      I suspect that there is a bug in the wrapping of JGroups receiver/channel/etc. in org.apache.activemq.artemis.api.core.JGroupsBroadcastEndpoint and the endpoint in DiscoveryGroup never receives the message that is actually received by JGroups in the ReceiverAdapter instantiated by JGroupsBroadcastEndpoint.JChannelWrapper#connect.

              rh-ee-ataylor Andy Taylor
              jmesnil1@redhat.com Jeff Mesnil
              Votes:
              1 Vote for this issue
              Watchers:
              4 Start watching this issue

                Created:
                Updated:
                Resolved: