-
Bug
-
Resolution: Done
-
Blocker
-
4.0.14
-
None
The ASYM_ENCRYPT_LeaveTest is designed to test graceful leaving coordinator(s) with ASYM_ENCRYPT in the stack. However, the test currently passes due to presence of MERGE3 in the stack. While the intention of the test seems to be testing graceful leaving of coordinator(s), the cluster ends up with inconsistent views later resolved by MERGE3.
Here is a run of the test with a modification of the test with a single coordinator leaving:
https://gist.github.com/rhusar/89172882fae60a1f29327c33f2d124db
The problem seems to be with coordinating of key exchange. In this run, roughly:
1. node 1 is leaving
2. node 2 becomes coordinator and key server
10:55:18.286 [jgroups-3,ASYM_ENCRYPT_LeaveTest,2] DEBUG org.jgroups.protocols.pbcast.GMS - 2: installing view [2|10] (9) [2, 3, 4, 5, 6, 7, 8, 9, 10] ... 10:55:18.299 [jgroups-3,ASYM_ENCRYPT_LeaveTest,2] DEBUG org.jgroups.protocols.ASYM_ENCRYPT - 2: I'm the new key server 10:55:18.300 [jgroups-3,ASYM_ENCRYPT_LeaveTest,2] DEBUG org.jgroups.protocols.ASYM_ENCRYPT - 2: created new secret key (version: AB1E6F44DE947D792A7D05D2E957AC85) ... 10:55:18.300 [jgroups-3,ASYM_ENCRYPT_LeaveTest,2] DEBUG org.jgroups.protocols.ASYM_ENCRYPT - 2: created new secret key (version: AB1E6F44DE947D792A7D05D2E957AC85)
3. node 9 receives FETCH_SECRET_KEY however receives stale key? looks like it still contacts the leaving coordinator node 1?
10:55:18.319 [SSL_KEY_EXCHANGE-runner-12,ASYM_ENCRYPT_LeaveTest,1] DEBUG org.jgroups.protocols.SSL_KEY_EXCHANGE - 1: accepted SSL connection from /127.0.0.1:51812; protocol: TLSv1, cipher suite: TLS_ECDHE_RSA_WITH_AES_256_CBC_SHA ... 10:55:18.319 [jgroups-3,ASYM_ENCRYPT_LeaveTest,9] DEBUG org.jgroups.protocols.SSL_KEY_EXCHANGE - 9: created SSL connection to 2 (/127.0.0.1:2157); protocol: TLSv1, cipher suite: TLS_ECDHE_RSA_WITH_AES_256_CBC_SHA 10:55:18.321 [jgroups-3,ASYM_ENCRYPT_LeaveTest,9] DEBUG org.jgroups.protocols.SSL_KEY_EXCHANGE - 9: sending up secret key (version: AF7916A9394F49B085D4F35C4F5A0A3E) 10:55:18.321 [jgroups-3,ASYM_ENCRYPT_LeaveTest,9] DEBUG org.jgroups.protocols.ASYM_ENCRYPT - 9: ignoring secret key received from key exchange protocol (version: AF7916A9394F49B085D4F35C4F5A0A3E), as it has already been installed
4. new coordinator fails to collect all acks (since it cannot decipher stale key?)
10:55:20.307 [jgroups-3,ASYM_ENCRYPT_LeaveTest,2] WARN org.jgroups.protocols.pbcast.GMS - 2: failed to collect all ACKs (expected=8) for view [2|10] after 2000ms, missing 1 ACKs from (1) 9
5. node 9 eventually obtains the key but since it has stale view and still thinks node 1 is coordinator? and fails to contact it
10:55:20.307 [jgroups-3,ASYM_ENCRYPT_LeaveTest,9] DEBUG org.jgroups.protocols.ASYM_ENCRYPT - 9: asking key exchange protocol to get secret key from 2 10:55:20.322 [SSL_KEY_EXCHANGE-runner-26,ASYM_ENCRYPT_LeaveTest,2] DEBUG org.jgroups.protocols.SSL_KEY_EXCHANGE - 2: accepted SSL connection from /127.0.0.1:51829; protocol: TLSv1, cipher suite: TLS_ECDHE_RSA_WITH_AES_256_CBC_SHA 10:55:20.322 [jgroups-3,ASYM_ENCRYPT_LeaveTest,9] DEBUG org.jgroups.protocols.SSL_KEY_EXCHANGE - 9: created SSL connection to 2 (/127.0.0.1:2158); protocol: TLSv1, cipher suite: TLS_ECDHE_RSA_WITH_AES_256_CBC_SHA 10:55:20.322 [jgroups-3,ASYM_ENCRYPT_LeaveTest,9] DEBUG org.jgroups.protocols.SSL_KEY_EXCHANGE - 9: sending up secret key (version: AB1E6F44DE947D792A7D05D2E957AC85) 10:55:20.322 [jgroups-3,ASYM_ENCRYPT_LeaveTest,9] DEBUG org.jgroups.protocols.ASYM_ENCRYPT - 9: installing secret key received from key exchange protocol (version: AB1E6F44DE947D792A7D05D2E957AC85) 10:55:23.341 [TQ-Bundler-10,ASYM_ENCRYPT_LeaveTest,9] DEBUG org.jgroups.protocols.TCP - JGRP000034: 9: failure sending message to 1: java.net.ConnectException: Connection refused (Connection refused)
6. cluster is later healed with MERGE3
10:55:27.103 [jgroups-27,ASYM_ENCRYPT_LeaveTest,2] DEBUG org.jgroups.protocols.pbcast.GMS - 2: I will be the merge leader. Starting the merge task. Views: {2=[2|10] (9) [2, 3, 4, 5, 6, 7, 8, 9, 10], 9=[1|9] (10) [1, 2, 3, 4, 5, 6, 7, 8, 9, 10]}
Another run with MERGE3 omitted from the stack is here:
https://gist.github.com/rhusar/b51aeee03485a607041f9669bbc6e707
Further investigation is ongoing, but this might be related to graceful leaving of coordinator JGRP-2293 exacerbating the problem with key exchange in ASYM_ENCRYPT.
Scaling down is typical cloud workflow, especially with encryption since ASYM_ENCRYPT is the recommended setup making this problem critical.
- is related to
-
CLOUD-2789 Problems with cluster scale-down when SYM_ENCRYPT configured
- New
-
JGRP-2293 Graceful concurrent leaving of coordinator(s) leaves the cluster with stale views
- Resolved
-
CLOUD-2417 [EAP CD] Clustering with openshift.KUBE_PING doesn't work correctly
- Closed