Uploaded image for project: 'JBoss Enterprise Application Platform'
  1. JBoss Enterprise Application Platform
  2. JBEAP-30368

EAP 8.1.0 Beta on OpenShift clustering - RHDG failover + scale down leads to (HTTP 504|HTTP 500) and EAP cache inconsistencies intermittently

XMLWordPrintable

    • Icon: Bug Bug
    • Resolution: Not a Bug
    • Icon: Critical Critical
    • None
    • 8.1.0.GA-CR4, 8.1.0.Beta
    • Clustering, OpenShift
    • False
    • Hide

      None

      Show
      None
    • False
    • User Experience
    • +

      An EAP 8.1.0 Beta + Red Hat Datagrid 8.5.3.GA interoperability test on OpenShift that validates EAP behavior against remote RHDG failover fails intermittently, signaling cache inconsistencies:

      java.lang.AssertionError: 
      1 expectation failed.
      JSON path value doesn't match.
      Expected: is "10"
        Actual: null
      
      	at java.base/jdk.internal.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method)
      	at java.base/jdk.internal.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:77)
      	at java.base/jdk.internal.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45)
      	at java.base/java.lang.reflect.Constructor.newInstanceWithCaller(Constructor.java:499)
      	at java.base/java.lang.reflect.Constructor.newInstance(Constructor.java:480)
      	at org.codehaus.groovy.reflection.CachedConstructor.invoke(CachedConstructor.java:73)
      	at org.codehaus.groovy.reflection.CachedConstructor.doConstructorInvoke(CachedConstructor.java:60)
      	at org.codehaus.groovy.runtime.callsite.ConstructorSite$ConstructorSiteNoUnwrap.callConstructor(ConstructorSite.java:86)
      	at org.codehaus.groovy.runtime.callsite.CallSiteArray.defaultCallConstructor(CallSiteArray.java:57)
      	at org.codehaus.groovy.runtime.callsite.AbstractCallSite.callConstructor(AbstractCallSite.java:263)
      	at org.codehaus.groovy.runtime.callsite.AbstractCallSite.callConstructor(AbstractCallSite.java:277)
      	at io.restassured.internal.ResponseSpecificationImpl$HamcrestAssertionClosure.validate(ResponseSpecificationImpl.groovy:512)
      	at io.restassured.internal.ResponseSpecificationImpl$HamcrestAssertionClosure$validate$1.call(Unknown Source)
      	at io.restassured.internal.ResponseSpecificationImpl.validateResponseIfRequired(ResponseSpecificationImpl.groovy:696)
      	at io.restassured.internal.ResponseSpecificationImpl.this$2$validateResponseIfRequired(ResponseSpecificationImpl.groovy)
      	at java.base/jdk.internal.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
      	at java.base/jdk.internal.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:77)
      	at java.base/jdk.internal.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
      	at java.base/java.lang.reflect.Method.invoke(Method.java:568)
      	at org.codehaus.groovy.runtime.callsite.PlainObjectMetaMethodSite.doInvoke(PlainObjectMetaMethodSite.java:43)
      	at org.codehaus.groovy.runtime.callsite.PogoMetaMethodSite$PogoCachedMethodSiteNoUnwrapNoCoerce.invoke(PogoMetaMethodSite.java:198)
      	at org.codehaus.groovy.runtime.callsite.PogoMetaMethodSite.callCurrent(PogoMetaMethodSite.java:62)
      	at org.codehaus.groovy.runtime.callsite.AbstractCallSite.callCurrent(AbstractCallSite.java:185)
      	at io.restassured.internal.ResponseSpecificationImpl.body(ResponseSpecificationImpl.groovy:270)
      	at io.restassured.specification.ResponseSpecification$body$1.callCurrent(Unknown Source)
      	at io.restassured.internal.ResponseSpecificationImpl.body(ResponseSpecificationImpl.groovy:117)
      	at io.restassured.internal.ValidatableResponseOptionsImpl.body(ValidatableResponseOptionsImpl.java:244)
      	at org.jboss.qa.appsint.tests.eap.rhdg.eap8.session.offload.Eap8WebCacheOffloadedToOperatorRhdgTests.testValue(Eap8WebCacheOffloadedToOperatorRhdgTests.java:262)
      	at org.jboss.qa.appsint.tests.eap.rhdg.eap8.session.offload.Eap8WebCacheOffloadedToOperatorRhdgTests.rhdgFailover(Eap8WebCacheOffloadedToOperatorRhdgTests.java:188)
      ...
      

      The deployment is built via the EAP Maven plugin with the cloud-default-config layer, plus the web-clustering, ejb, and ejb-dist-cache, and excluding the ejb-local-cache layer.
      The infinispan subsystem is configured to connect via HotRod:

      /socket-binding-group=standard-sockets/remote-destination-outbound-socket-binding=rhdg:add(host=${env.JDG_HOST}, port=${env.JDG_PORT})
      /subsystem=infinispan/remote-cache-container=rhdg-container:add(default-remote-cluster=data-grid-cluster)
      /subsystem=infinispan/remote-cache-container=rhdg-container/remote-cluster=data-grid-cluster:add(socket-bindings=[rhdg])
      /subsystem=infinispan/cache-container=web/invalidation-cache=rhdg-cache:add()
      /subsystem=infinispan/cache-container=web/invalidation-cache=rhdg-cache/store=hotrod:add(remote-cache-container=rhdg-container,fetch-state=false,purge=false,passivation=false,shared=true)
      /subsystem=infinispan/cache-container=web:write-attribute(name=default-cache,value=rhdg-cache)
      /subsystem=infinispan/remote-cache-container=rhdg-container:write-attribute(name=properties, value={infinispan.client.hotrod.auth_realm=default,infinispan.client.hotrod.use_auth=true,infinispan.client.hotrod.auth_username=${env.CACHE_USERNAME},infinispan.client.hotrod.auth_password=${env.CACHE_PASSWORD},infinispan.client.hotrod.auth_server_name=rhdg-host,infinispan.client.hotrod.sasl_properties.javax.security.sasl.qop=auth,infinispan.client.hotrod.sasl_mechanism=SCRAM-SHA-512,infinispan.client.hotrod.sni_host_name=rhdg-host,infinispan.client.hotrod.ssl_hostname_validation=false,infinispan.client.hotrod.trust_store_path=/var/run/secrets/kubernetes.io/serviceaccount/service-ca.crt,})
      
      

      The test logic is about creating an EAP cluster that offloads a web session cache to a RHDG cluster, and checking that the expected values are stored in the cache when an RHDG instance is ungracefully stopped and the related 3 replicas cluster scaled down to 2 immediately after that.

      This is similar to JBEAP-29870, but about an RHDG failover scenario, rather than an EAP one.
      The overall configuration (layers + infinispan subsystem) has been validated already by developers, so we're setting this as a blocker for 8.1.0 GA.

      Regarding the test logic, here's a source code fragment, enriched with numbered comments to emphasize the most relevant steps:

                      // 1. start a 2 replicas RHDG cluster, then - once it's well-formed - starting a 2 replicas EAP cluster 
                      setInitialClustersReplicas();
      		List<Pod> pods = rhdgOpenShiftProvisioner.getPods();
                      // 2. get a reference to the RHDG pod that will be deleted
      		Pod podToFail = pods.get(0);
      		log.debug("The \"{}\" pod will be terminated ungracefully to simulate Infinispan/RHDG failover",
      				podToFail.getMetadata().getName());
                      // 3. store a web session value, which is persisted to the remote Infinispan cache
      		RequestSpecification session = RestAssured.given().accept(ContentType.JSON)
      				.filter(new SessionFilter());
      		putValue(session, 10);
                      
      		// 4. as noted in https://issues.redhat.com/browse/JBEAP-29870 - here we need to add a sleep period for the
      		// pod deletion since it is not guaranteed that data was successfully replicated/persisted prior to abrupt pod
      		// deletion, which would make the test fail intermittently.
      		Thread.sleep(PAUSE_TO_ALLOW_DATA_REPLICATION_IN_SECONDS * 1000);
      		testValue(session, 10);
                      
                      // 4. scaling the RHDG cluster up to 3 replicas
      		log.debug("Scaling Infinispan/RHDG cluster up to 3 replicas...");
      		rhdgOpenShiftProvisioner.scale(3, true);
      
                      // 5. deleting the first RHDG pod
      		//	killing the first pod will cause the RHDG Operator to try and redeploy it
      		rhdgOpenShiftProvisioner.getOpenShift().deletePod(podToFail);
      
                      // 6. scaling the RHDG cluster dow to 2 replicas immediately after the pod deletion 
      		//	but here we scale down to 2, so the operator should:
      		//	a. react to the #0-pod deletion by spinning it up again
      		//	b. once it's ready, react to the sale down request by deleting the #1-pod
      		log.debug("Scaling Infinispan/RHDG cluster down to 2 replicas...");
      		rhdgOpenShiftProvisioner.scale(2, true);
      
                      // 7. read the value, here's where the test is failing intermittently
      		testValue(session, 10);
      

      As a final note, both the EAP pods have clean logs at the end of the test execution, and the same applies to the 2 remaining RHDG pods.
      Feel free to reach out for any additional details.

        1. eap-1-r78qs.log
          42 kB
        2. eap-1-52lnj.log
          35 kB
        3. rhdg-0.log
          29 kB
        4. rhdg-1.log
          22 kB

              pferraro@redhat.com Paul Ferraro
              fburzigo@redhat.com Fabio Burzigotti
              Votes:
              0 Vote for this issue
              Watchers:
              5 Start watching this issue

                Created:
                Updated:
                Resolved: