Uploaded image for project: 'JGroups'
  1. JGroups
  2. JGRP-2297

Coordinator with ASYM_ENCRYPT in the stack does not leave gracefully

    Details

    • Type: Bug
    • Status: Resolved (View Workflow)
    • Priority: Blocker
    • Resolution: Done
    • Affects Version/s: 4.0.14
    • Fix Version/s: 4.0.19
    • Labels:
      None

      Description

      The ASYM_ENCRYPT_LeaveTest is designed to test graceful leaving coordinator(s) with ASYM_ENCRYPT in the stack. However, the test currently passes due to presence of MERGE3 in the stack. While the intention of the test seems to be testing graceful leaving of coordinator(s), the cluster ends up with inconsistent views later resolved by MERGE3.

      Here is a run of the test with a modification of the test with a single coordinator leaving:
      https://gist.github.com/rhusar/89172882fae60a1f29327c33f2d124db

      The problem seems to be with coordinating of key exchange. In this run, roughly:

      1. node 1 is leaving
      2. node 2 becomes coordinator and key server

      10:55:18.286 [jgroups-3,ASYM_ENCRYPT_LeaveTest,2] DEBUG org.jgroups.protocols.pbcast.GMS - 2: installing view [2|10] (9) [2, 3, 4, 5, 6, 7, 8, 9, 10]
      ...
      10:55:18.299 [jgroups-3,ASYM_ENCRYPT_LeaveTest,2] DEBUG org.jgroups.protocols.ASYM_ENCRYPT - 2: I'm the new key server
      10:55:18.300 [jgroups-3,ASYM_ENCRYPT_LeaveTest,2] DEBUG org.jgroups.protocols.ASYM_ENCRYPT - 2: created new secret key (version: AB1E6F44DE947D792A7D05D2E957AC85)
      ...
      10:55:18.300 [jgroups-3,ASYM_ENCRYPT_LeaveTest,2] DEBUG org.jgroups.protocols.ASYM_ENCRYPT - 2: created new secret key (version: AB1E6F44DE947D792A7D05D2E957AC85)
      

      3. node 9 receives FETCH_SECRET_KEY however receives stale key? looks like it still contacts the leaving coordinator node 1?

      10:55:18.319 [SSL_KEY_EXCHANGE-runner-12,ASYM_ENCRYPT_LeaveTest,1] DEBUG org.jgroups.protocols.SSL_KEY_EXCHANGE - 1: accepted SSL connection from /127.0.0.1:51812; protocol: TLSv1, cipher suite: TLS_ECDHE_RSA_WITH_AES_256_CBC_SHA
      ...
      10:55:18.319 [jgroups-3,ASYM_ENCRYPT_LeaveTest,9] DEBUG org.jgroups.protocols.SSL_KEY_EXCHANGE - 9: created SSL connection to 2 (/127.0.0.1:2157); protocol: TLSv1, cipher suite: TLS_ECDHE_RSA_WITH_AES_256_CBC_SHA
      10:55:18.321 [jgroups-3,ASYM_ENCRYPT_LeaveTest,9] DEBUG org.jgroups.protocols.SSL_KEY_EXCHANGE - 9: sending up secret key (version: AF7916A9394F49B085D4F35C4F5A0A3E)
      10:55:18.321 [jgroups-3,ASYM_ENCRYPT_LeaveTest,9] DEBUG org.jgroups.protocols.ASYM_ENCRYPT - 9: ignoring secret key received from key exchange protocol (version: AF7916A9394F49B085D4F35C4F5A0A3E), as it has already been installed
      

      4. new coordinator fails to collect all acks (since it cannot decipher stale key?)

      10:55:20.307 [jgroups-3,ASYM_ENCRYPT_LeaveTest,2] WARN  org.jgroups.protocols.pbcast.GMS - 2: failed to collect all ACKs (expected=8) for view [2|10] after 2000ms, missing 1 ACKs from (1) 9
      

      5. node 9 eventually obtains the key but since it has stale view and still thinks node 1 is coordinator? and fails to contact it

      10:55:20.307 [jgroups-3,ASYM_ENCRYPT_LeaveTest,9] DEBUG org.jgroups.protocols.ASYM_ENCRYPT - 9: asking key exchange protocol to get secret key from 2
      10:55:20.322 [SSL_KEY_EXCHANGE-runner-26,ASYM_ENCRYPT_LeaveTest,2] DEBUG org.jgroups.protocols.SSL_KEY_EXCHANGE - 2: accepted SSL connection from /127.0.0.1:51829; protocol: TLSv1, cipher suite: TLS_ECDHE_RSA_WITH_AES_256_CBC_SHA
      10:55:20.322 [jgroups-3,ASYM_ENCRYPT_LeaveTest,9] DEBUG org.jgroups.protocols.SSL_KEY_EXCHANGE - 9: created SSL connection to 2 (/127.0.0.1:2158); protocol: TLSv1, cipher suite: TLS_ECDHE_RSA_WITH_AES_256_CBC_SHA
      10:55:20.322 [jgroups-3,ASYM_ENCRYPT_LeaveTest,9] DEBUG org.jgroups.protocols.SSL_KEY_EXCHANGE - 9: sending up secret key (version: AB1E6F44DE947D792A7D05D2E957AC85)
      10:55:20.322 [jgroups-3,ASYM_ENCRYPT_LeaveTest,9] DEBUG org.jgroups.protocols.ASYM_ENCRYPT - 9: installing secret key received from key exchange protocol (version: AB1E6F44DE947D792A7D05D2E957AC85)
      10:55:23.341 [TQ-Bundler-10,ASYM_ENCRYPT_LeaveTest,9] DEBUG org.jgroups.protocols.TCP - JGRP000034: 9: failure sending message to 1: java.net.ConnectException: Connection refused (Connection refused)
      

      6. cluster is later healed with MERGE3

      10:55:27.103 [jgroups-27,ASYM_ENCRYPT_LeaveTest,2] DEBUG org.jgroups.protocols.pbcast.GMS - 2: I will be the merge leader. Starting the merge task. Views: {2=[2|10] (9) [2, 3, 4, 5, 6, 7, 8, 9, 10], 9=[1|9] (10) [1, 2, 3, 4, 5, 6, 7, 8, 9, 10]}
      

      Another run with MERGE3 omitted from the stack is here:
      https://gist.github.com/rhusar/b51aeee03485a607041f9669bbc6e707

      Further investigation is ongoing, but this might be related to graceful leaving of coordinator JGRP-2293 exacerbating the problem with key exchange in ASYM_ENCRYPT.

      Scaling down is typical cloud workflow, especially with encryption since ASYM_ENCRYPT is the recommended setup making this problem critical.

        Gliffy Diagrams

          Attachments

            Issue Links

              Activity

                People

                • Assignee:
                  belaban Bela Ban
                  Reporter:
                  rhusar Radoslav Husar
                • Votes:
                  0 Vote for this issue
                  Watchers:
                  4 Start watching this issue

                  Dates

                  • Created:
                    Updated:
                    Resolved: