Uploaded image for project: 'JBoss Enterprise Application Platform 4 and 5'
  1. JBoss Enterprise Application Platform 4 and 5
  2. JBPAPP-7205

HornetQ - HA with disconnected journal

XMLWordPrintable

    • Icon: Bug Bug
    • Resolution: Done
    • Icon: Critical Critical
    • EAP_EWP 5.1.1, EAP_EWP 5.1.2 CR1, EAP_EWP 5.1.2 CR3, EAP_EWP 5.1.2 CR4
    • HornetQ
    • None
    • RHEL 6 x86-64 with GFS2/SAN

    • Hide
      In a situation with clustered HornetQ instances where a cluster node has its journal disconnected - e.g. when the server loses its connection to the SAN - the other nodes did not take over in place of the failed node. This problem has now been fixed and failover from a failed HornetQ node now occurs without interruption to the client.
      Show
      In a situation with clustered HornetQ instances where a cluster node has its journal disconnected - e.g. when the server loses its connection to the SAN - the other nodes did not take over in place of the failed node. This problem has now been fixed and failover from a failed HornetQ node now occurs without interruption to the client.
    • Documented as Resolved Issue
    • NEW

      Hi Clebert,

      as we agreed we've started developing tests with disconnected journal according to HornetQ test plan (section 10). For now all test scenarios are failing because HornetQ architecture was not initially designed to handle such a situation. I'd like to share here current test results and some information about testing environment.

      Test Scenario - "Node is disconnected from journal" - collocated backup (corresponds to section 10.1.1):
      1. Start cluster - EAP servers A and B
      2. Start "live" producer and "live" consumer connected to server A and sending messages to "liveQueue" - active for the whole duration of the test
      3. Start producer - send 1000 messages to "testQueue" to server A
      4. Disconnect SAN from server A
      5. Start consumer - read from server B from "testQueue"

      Pass criteria:
      After step 4 the backup node will take its role.
      Clients will be reconnected to backup node and will be able to continue with its work.

      Test results:
      After step 4.:

      • EAP server B won't take its role - backup doesn't come to live
      • "live" producer/consumer ends with exception - attached logs - and don't failover to EAP server B
        In step 5. consumer on EAP node B is able to read only half of the messages sent in step 3. to "testQueue" (load- balancing)

      Note about testing environment:
      GFS2/SAN is using "fenced" daemon which power off nodes which failed. By disconnecting SAN this happens but it takes couple of minutes. Considering our test scenario after step 4 - New clients can connect to EAP server A and fail to read/send any messages. EAP server B just deliver messages which are in its journal when clients connect to it.

      Do we have some solution already?

      Thank you,

      Mirek

        1. live_consumer.log
          5 kB
          Miroslav Novak
        2. live_producer.log
          10 kB
          Miroslav Novak
        3. san_consumer_threaddump.txt
          12 kB
          Miroslav Novak
        4. serverA.log
          79 kB
          Miroslav Novak
        5. server-A-threaddump.txt
          157 kB
          Miroslav Novak
        6. serverB.log
          37 kB
          Miroslav Novak
        7. server-B-threaddump.txt
          187 kB
          Miroslav Novak

              csuconic@redhat.com Clebert Suconic
              mnovak1@redhat.com Miroslav Novak
              Russell Dickenson Russell Dickenson (Inactive)
              Votes:
              0 Vote for this issue
              Watchers:
              4 Start watching this issue

                Created:
                Updated:
                Resolved: