Uploaded image for project: 'JGroups'
  1. JGroups
  2. JGRP-1376

ENCRYPT race condition hangs cluster channel

XMLWordPrintable

    • Icon: Bug Bug
    • Resolution: Done
    • Icon: Major Major
    • 3.0
    • 3.0
    • None
    • Release Notes
    • High
    • Workaround Exists
    • Hide

      Using a keystore should avoid the problem, since there are no key servers to begin with.

      Show
      Using a keystore should avoid the problem, since there are no key servers to begin with.

      We have two nodes on a channel. When starting both simultaneously (we spawn threads from a third control server), sometimes the JGroups channel hangs on both servers and we get the following logs forever:

      [WARN ] 2011-10-15 17:19:17 [ENCRYPT-Incoming-2,null] - attempting to use stored cipher as message does not uses current encryption version
      [WARN ] 2011-10-15 17:19:17 [ENCRYPT-Incoming-2,null] - Unable to find a matching cipher in previous key map
      [WARN ] 2011-10-15 17:19:17 [ENCRYPT-Incoming-2,null] - Unrecognised cipher discarding message
      [WARN ] 2011-10-15 17:19:17 [ENCRYPT-Incoming-3,null] - attempting to use stored cipher as message does not uses current encryption version
      [WARN ] 2011-10-15 17:19:17 [ENCRYPT-Incoming-3,null] - Unable to find a matching cipher in previous key map
      [WARN ] 2011-10-15 17:19:17 [ENCRYPT-Incoming-3,null] - Unrecognised cipher discarding message

      We drilled down to the code in ENCRYPT, and we believe this is related to issue JGRP-706. That issue introduced a fix for a different problem when using dynamic keys which seems to be causing problems when starting nodes simultaneously. This did not happen to us before we moved to a simulatenous multi-node start scenario, so it seems to indicate there is some kind of race condition which is triggered only when the nodes start at almost the same time (they run the same hardware, OS, JVM, etc. so startup times are almost identical)

      This is the ENCRYPT config:

      <ENCRYPT
      sym_init="128"
      sym_algorithm="AES"
      asym_init="512"
      asym_algorithm="RSA"
      />

      The tipoff that something is wrong here, and what led us to study the patch introduced by JGRP-706, is that we have two servers / nodes: one is named gargamel, and the other azrael. The odd thing is that we got these logs in gargamel:

      [INFO ] 2011-10-15 17:19:13 [ENCRYPT-Incoming-3,null] - handling tmp-view up: MergeView::[azrael-8126|2] [azrael-8126, gargamel-29063], subgroups=[[azrael-8126|0] [azrael-8126], [gargamel-29063|1] [gargamel-29063]]
      [INFO ] 2011-10-15 17:19:13 [ENCRYPT-Incoming-3,null] - I have become key server azrael-8126

      There some kind of identity-crisis going on here - how can gargamel-29063 have become azrael-8126? The preceding log line indicates that the patch introduced by JGRP-706 is at play here. A TMP_VIEW event triggers the node into becoming key server. handleViewChange() is invoked with parameter makeServer=true which triggers this line:

      becomeKeyServer(tmpKeyServer);

      where tmpKeyServer is azrael-8126

      We are not really knowledgeable of what should actually happen here, in order for everything to work properly, including the situation described by JGRP-706. However, it seems that in any case, if the local node is to become key server, then the method argument should probably be invoked becomeKeyServer(local_addr) - although I'm not sure whether this is correct, either.

      After this happens, we immediately start to get the log errors described at the beginning of this report.

              rhn-engineering-bban Bela Ban
              madsrenxo Manuel Dominguez Sarmiento (Inactive)
              Votes:
              0 Vote for this issue
              Watchers:
              2 Start watching this issue

                Created:
                Updated:
                Resolved: