Uploaded image for project: 'JGroups'
  1. JGroups
  2. JGRP-2296

DNS_PING is dropping port values with SRV based service discovery

XMLWordPrintable

    • Icon: Bug Bug
    • Resolution: Done
    • Icon: Blocker Blocker
    • 4.0.16
    • 4.0.11
    • None
    • DNS_PING based service discovery now captures port values from SRV records
    • Hide

      1. Set up a jboss/keycloak HA cluster using the Jgroups config below in AWS ECS with Service Discovery and dynamic port mapping
      2. Set logging to DEBUG
      3. You will see that the cluster never forms and the below port defaults are used (but they don't work).

      Show
      1. Set up a jboss/keycloak HA cluster using the Jgroups config below in AWS ECS with Service Discovery and dynamic port mapping 2. Set logging to DEBUG 3. You will see that the cluster never forms and the below port defaults are used (but they don't work).

      Using DNS_PING in Jgroups 4.0.11 and SRV records the port from the SRV record is being dropped (set to zero) and the default is used instead (7600).

      I am using this Jgroups config:

      <subsystem xmlns="urn:jboss:domain:jgroups:6.0">
                  <channels default="ee">
                      <channel name="ee" stack="tcp" cluster="ejb"/>
                  </channels>
                  <stacks>
                      <stack name="tcp">
                          <transport type="TCP" socket-binding="jgroups-tcp">
                              <property name="external_addr">${env.EXTERNAL_ADDR}</property>
                          </transport>
                          <protocol type="dns.DNS_PING">
                              <property name="dns_query">
                                  jgroups.${env.DNS_NAME}.svc.cluster.local
                              </property>
                              <property name="dns_record_type">
                                  SRV
                              </property>
                          </protocol>
                          <protocol type="MERGE3"/>
                          <protocol type="FD_SOCK"/>
                          <protocol type="FD_ALL"/>
                          <protocol type="VERIFY_SUSPECT"/>
                          <protocol type="pbcast.NAKACK2"/>
                          <protocol type="UNICAST3"/>
                          <protocol type="pbcast.STABLE"/>
                          <protocol type="pbcast.GMS"/>
                          <protocol type="MFC"/>
                          <protocol type="FRAG3"/>
                      </stack>
                  </stacks>
              </subsystem>
      

      I have these service discovery DNS entries

      $ dig jgroups.dev.auth.example.com.svc.cluster.local SRV
      
      ; <<>> DiG 9.8.2rc1-RedHat-9.8.2-0.68.rc1.58.amzn1 <<>> jgroups.dev.auth.example.com.svc.cluster.local SRV
      ;; global options: +cmd
      ;; Got answer:
      ;; ->>HEADER<<- opcode: QUERY, status: NOERROR, id: 16690
      ;; flags: qr rd ra; QUERY: 1, ANSWER: 4, AUTHORITY: 0, ADDITIONAL: 0
      
      ;; QUESTION SECTION:
      ;jgroups.dev.auth.example.com.svc.cluster.local. IN SRV
      
      ;; ANSWER SECTION:
      jgroups.dev.auth.example.com.svc.cluster.local. 10 IN SRV 1 1 32921 9ec82e3f-3a0e-4e30-b785-17879c63cd7d.jgroups.dev.auth.example.com.svc.cluster.local.
      jgroups.dev.auth.example.com.svc.cluster.local. 10 IN SRV 1 1 32923 60b5a820-9678-4bd2-84c6-00061a52bde0.jgroups.dev.auth.example.com.svc.cluster.local.
      jgroups.dev.auth.example.com.svc.cluster.local. 10 IN SRV 1 1 32915 9d9d78d0-8919-4b91-9df8-2e4e65afedae.jgroups.dev.auth.example.com.svc.cluster.local.
      jgroups.dev.auth.example.com.svc.cluster.local. 10 IN SRV 1 1 32917 161f3d66-f1e3-46f4-a44f-ebda925a25c6.jgroups.dev.auth.example.com.svc.cluster.local.
      
      ;; Query time: 2 msec
      ;; SERVER: 10.42.3.2#53(10.42.3.2)
      ;; WHEN: Fri Sep 21 01:45:44 2018
      ;; MSG SIZE  rcvd: 481
      

      But I get this in the logs when running Keycloak in standalone cluster:

      17:45:10,121 DEBUG [org.jgroups.protocols.dns.DNS_PING] (thread-3,null,null) Performing initial discovery
      17:45:10,154 DEBUG [org.jgroups.protocols.dns.DNS_PING] (thread-3,null,null) Entries collected from DNS: [10.42.3.56:0, 10.42.3.56:0, 10.42.3.44:0, 10.42.3.44:0]
      17:45:10,155 DEBUG [org.jgroups.protocols.dns.DNS_PING] (thread-3,null,null) Discovered IP Address with port 0 (10.42.3.56:0). Replacing with default Transport port: 7600
      17:45:10,159 DEBUG [org.jgroups.protocols.dns.DNS_PING] (thread-3,null,null) Discovered IP Address with port 0 (10.42.3.56:0). Replacing with default Transport port: 7600
      17:45:10,159 DEBUG [org.jgroups.protocols.dns.DNS_PING] (thread-3,null,null) Discovered IP Address with port 0 (10.42.3.44:0). Replacing with default Transport port: 7600
      17:45:10,159 DEBUG [org.jgroups.protocols.dns.DNS_PING] (thread-3,null,null) Discovered IP Address with port 0 (10.42.3.44:0). Replacing with default Transport port: 7600
      17:45:10,159 DEBUG [org.jgroups.protocols.dns.DNS_PING] (thread-3,null,null) Performing discovery of the following hosts [10.42.3.56:7600, 10.42.3.44:7600, e200a617bf7a]
      17:45:10,159 DEBUG [org.jgroups.protocols.dns.DNS_PING] (thread-3,null,null) e200a617bf7a: sending discovery request to 10.42.3.56:7600
      17:45:10,160 DEBUG [org.jgroups.protocols.dns.DNS_PING] (thread-3,null,null) e200a617bf7a: sending discovery request to 10.42.3.44:7600
      17:45:10,160 DEBUG [org.jgroups.protocols.dns.DNS_PING] (thread-10,ejb,e200a617bf7a) Received discovery from: e200a617bf7a, IP: 10.42.3.44:7600
      17:45:10,161 DEBUG [org.jgroups.protocols.dns.DNS_PING] (thread-3,null,null) e200a617bf7a: sending discovery request to e200a617bf7a
      17:45:10,162 DEBUG [org.jgroups.protocols.dns.DNS_PING] (thread-11,ejb,e200a617bf7a) Received discovery from: e200a617bf7a, IP: 10.42.3.44:7600
      

      As you can see it is resolving the DNS addresses, but discarding the ports.

      To be clear, in this example 32923 ids the port (eg:
      1 1 32923 60b5a820-9678-4bd2-84c6-00061a52bde0.jgroups.dev.auth.example.com.svc.cluster.local).

      These are dynamic ports mapped to port 7600 in order to put more Keycloak containers on each instance.

      $ docker ps
      CONTAINER ID        IMAGE                                                              COMMAND                  CREATED             STATUS                 PORTS                                              NAMES
      f67e39f8f403        datadog/agent:latest-jmx                                           "/init"                  8 hours ago         Up 8 hours (healthy)   8125/udp, 8126/tcp                                 ecs-auth-service-dev-26-datadog-agent-a2b7f783ddd0ba9cf601
      bbb12f0c43a5        233747045000.dkr.ecr.us-east-2.amazonaws.com/ops/keycloak:latest   "/opt/jboss/tools/do…"   8 hours ago         Up 8 hours             0.0.0.0:32923->7600/tcp, 0.0.0.0:32922->8080/tcp   ecs-auth-service-dev-26-keycloak-f4bd8f8dca9fd4cd4f00
      932cad7c4fb9        datadog/agent:latest-jmx                                           "/init"                  8 hours ago         Up 8 hours (healthy)   8125/udp, 8126/tcp                                 ecs-auth-service-dev-26-datadog-agent-baa38a98ccaddea6f501
      e200a617bf7a        233747045000.dkr.ecr.us-east-2.amazonaws.com/ops/keycloak:latest   "/opt/jboss/tools/do…"   8 hours ago         Up 8 hours             0.0.0.0:32921->7600/tcp, 0.0.0.0:32920->8080/tcp   ecs-auth-service-dev-26-keycloak-e6f398e6cc8db5b5f101
      73bc0b863c73        amazon/amazon-ecs-agent:latest                                     "/agent"                 2 days ago          Up 2 days                                                                 ecs-agent
      

      This seems like it might be where ports are getting lost:
      https://github.com/belaban/JGroups/blob/07060c3ba6e52ad4aad3ac799c2bc95ffd2fe7ff/src/org/jgroups/protocols/dns/DefaultDNSResolver.java#L84

      I don't see the port number being extracted from the SRV entry and appended to the IP returned from resolveAEntries.

      Let me know if I am missing any details. This is a major blocker for development.

              rhn-engineering-bban Bela Ban
              ethompson_jira Eric Thompson (Inactive)
              Eric Thompson Eric Thompson (Inactive)
              Votes:
              0 Vote for this issue
              Watchers:
              2 Start watching this issue

                Created:
                Updated:
                Resolved: