Uploaded image for project: 'JBoss Enterprise Application Platform'
  1. JBoss Enterprise Application Platform
  2. JBEAP-24863

OpenShift - High frequency distributed timer fail-over is failing sometimes

XMLWordPrintable

    • False
    • None
    • False
    • Hide

      Locally:

      • Clone the EAP QE OpenShift test suite repository
      • Get testable bits, e.g. the productized EAP 8 Maven repository
      • Build the test suite, see the README.md
      • Configure the test suite execution by providing required parameters (feel free to reach out to QE to get values, e.g. the test cluster URL etc.)
      • Execute the test in debug mode, so that you can stop and monitor the cluster pods' logs. E.g.: mvn clean test -B -P80-openjdk17 -Dtest=EjbDistributedTimersTest -Dmaven.surefire.debug="-agentlib:jdwp=transport=dt_socket,server=y,suspend=y,address=localhost:8000"

      On Jenkins

      • Rebuild the internal job which is linked as part of the first comment (no debug mode here, hence monitoring the cluster becomes a bit harder)
      Show
      Locally: Clone the EAP QE OpenShift test suite repository Get testable bits, e.g. the productized EAP 8 Maven repository Build the test suite, see the README.md Configure the test suite execution by providing required parameters (feel free to reach out to QE to get values, e.g. the test cluster URL etc.) Execute the test in debug mode, so that you can stop and monitor the cluster pods' logs. E.g.: mvn clean test -B -P80-openjdk17 -Dtest=EjbDistributedTimersTest -Dmaven.surefire.debug="-agentlib:jdwp=transport=dt_socket,server=y,suspend=y,address=localhost:8000" On Jenkins Rebuild the internal job which is linked as part of the first comment (no debug mode here, hence monitoring the cluster becomes a bit harder)

      We have an OpenShift test that fails in about 50% cases when executed on EAP QE Jenkins pipelines.

      The mentioned test deploys:

      • a PostgreSql service to store timer expirations metadata
      • an EAP application service that exposes endpoints to handle such persistence operations
      • an EAP application service that exposes two EJB timer beans, one which is transactionally calling the persistence APIs to record its expirations and another one that just logs messages, without storing any expiration metadata. This application service also exposes endpoints for the test class to be able to create and delete timers, and to retrieve information about them. The EJB timer beans are called remotely by such endpoints.

      The timer persistence is delegated to the Infinispan subsystem, as per EAP7-1417.

      After deploying the scenario, several tests are run, e.g. to verify a timer can be created or deleted successfully, and then a couple of fail-over scenario tests are executed.

      The one which is failing is about an high frequency (.5 seconds) distributed timer that is created and executed by a pod which is stopped after some time.
      The timer expirations are recorded by the persistence mechanism and we are facing the case where we the actual count of recorded expirations in a period of time is less than the expected one (i.e. 95% of expected timeouts are recorded).

      The failures hasn't been noticed when running the same test/configuration locally so far.

      Links to internal resources documenting the test behavior are reported as part of the first comments.

        1. eap-distributed-ejb-timers-app-1-2ttbf.log
          72 kB
          Fabio Burzigotti
        2. eap-distributed-ejb-timers-app-1-dtvh9.log
          29 kB
          Fabio Burzigotti
        3. eap-distributed-ejb-timers-app-1-xmfzg.log
          42 kB
          Fabio Burzigotti
        4. everything.log
          53 kB
          Fabio Burzigotti
        5. high-freq-failure_less-than-expected.log
          14 kB
          Fabio Burzigotti

            pferraro@redhat.com Paul Ferraro
            fburzigo Fabio Burzigotti
            Votes:
            0 Vote for this issue
            Watchers:
            6 Start watching this issue

              Created:
              Updated:
              Resolved: