Uploaded image for project: 'WildFly Core'
  1. WildFly Core
  2. WFCORE-6151

Testsuite service container stability improvements

XMLWordPrintable

    • Icon: Task Task
    • Resolution: Unresolved
    • Icon: Major Major
    • None
    • None
    • None
    • None

      This issue relates to investigations and potential improvements in how WildFly interacts with MSC in order to help address the various testsuite instability issues that have been happening increasingly frequently over the past year. I'm referring to instability issues that relate to test expectations that services will be in a given state (e.g. UP), but intermittently they are not.

      Per discussion with Richard Opalka, the underlying issues here are:

      1) Using StabilityMonitor to track stability is problematic as the monitor is not tracking dependencies unless it was added to those dependencies. So the graph is not complete.
      2) Newly installed services go into DOWN state and use async tasks to notify their dependencies that they are needed. Once those tasks complete (i.e. delivering a message) the service controller is still in DOWN state but has no async tasks, which MSC regards as a stable state. If all services associated with a stability monitor are in this state, the monitor will report stability, but soon thereafter other async work will start moving the service toward UP. IOW, from the naive point of view the StabilityMonitor reported a 'false positive'.
      3) The MSC state machine has been undergoing optimization over the recent past which speeds things up but unfortunately increases the potential for these false positives.

      Work on this issue can include things like:

      1) Improving how the production code OperationContext impls track service container stability. (If this is achievable, that's best as then production workloads will be better as well.)
      2) Analyzying our test framework uses of MSC to identify and correct incorrect use.
      3) Analyzing our test framework uses of MSC to identify areas where potential instability can be worked around in a centralized way, those eliminating the need for per-test workarounds. Note that this is the least attractive approach as it amounts to a workaround, not a true fix.

      I suggest that specific actions be tracked via separate issues linked to this one. Use this as a kind of Epic.

      Various workaround fixes to individual tests that are done to avoid intermittent failures should have their own JIRAs and those JIRAs should be linked to this one so we can track removing the workarounds if better solutions are found. Consider the workarounds as analogous to adding and @Ignore to a test with a reference to this JIRA; but instead we keep the test running. (The typical workaround would be having a test call ServiceController.awaitValue() when retrieving a Service value in a case where the expected software behavior should mean a getValue call will be reliable.)

      By test framework I mean things like ModelTestController service and descendants, or the testing framework used in subsystem tests.

              yborgess1@redhat.com Yeray Borges Santana
              bstansbe@redhat.com Brian Stansberry
              Votes:
              0 Vote for this issue
              Watchers:
              3 Start watching this issue

                Created:
                Updated: