Loading...

XML

Word

Printable

Type: Task
Resolution: Unresolved
Priority: Major
Fix Version/s: None
Affects Version/s: None
Component/s: None
Labels:
None

This issue relates to investigations and potential improvements in how WildFly interacts with MSC in order to help address the various testsuite instability issues that have been happening increasingly frequently over the past year. I'm referring to instability issues that relate to test expectations that services will be in a given state (e.g. UP), but intermittently they are not.

Per discussion with Richard Opalka, the underlying issues here are:

1) Using StabilityMonitor to track stability is problematic as the monitor is not tracking dependencies unless it was added to those dependencies. So the graph is not complete.
2) Newly installed services go into DOWN state and use async tasks to notify their dependencies that they are needed. Once those tasks complete (i.e. delivering a message) the service controller is still in DOWN state but has no async tasks, which MSC regards as a stable state. If all services associated with a stability monitor are in this state, the monitor will report stability, but soon thereafter other async work will start moving the service toward UP. IOW, from the naive point of view the StabilityMonitor reported a 'false positive'.
3) The MSC state machine has been undergoing optimization over the recent past which speeds things up but unfortunately increases the potential for these false positives.

Work on this issue can include things like:

1) Improving how the production code OperationContext impls track service container stability. (If this is achievable, that's best as then production workloads will be better as well.)
2) Analyzying our test framework uses of MSC to identify and correct incorrect use.
3) Analyzing our test framework uses of MSC to identify areas where potential instability can be worked around in a centralized way, those eliminating the need for per-test workarounds. Note that this is the least attractive approach as it amounts to a workaround, not a true fix.

I suggest that specific actions be tracked via separate issues linked to this one. Use this as a kind of Epic.

Various workaround fixes to individual tests that are done to avoid intermittent failures should have their own JIRAs and those JIRAs should be linked to this one so we can track removing the workarounds if better solutions are found. Consider the workarounds as analogous to adding and @Ignore to a test with a reference to this JIRA; but instead we keep the test running. (The typical workaround would be having a test call ServiceController.awaitValue() when retrieving a Service value in a case where the expected software behavior should mean a getValue call will be reliable.)

By test framework I mean things like ModelTestController service and descendants, or the testing framework used in subsystem tests.

is depended on by

WFCORE-6197 Upgrade JBoss MSC to 1.5.0.Final

Closed

is related to

WFCORE-6141 OtherServicesSubsystemTestCase.testOtherService and testPath fail intermittently

Closed

WFCORE-6156 Use ServiceContainer.awaitStability() instead of StabilityMonitor.awaitStability() in ContainerStateMonitor

Closed

WFCORE-6182 Stabilize subsystem tests

Closed

WFLY-17325 Fix ServerServiceTestCase transient failure

Closed

WFLY-17347 UndertowSubsystemXXXTestCase.testRuntime fails frequently

Closed

relates to

WFCORE-6157 Investigate what's wrong with operations cancellation

Closed

(1 is related to, 1 relates to)

Assignee:: Yeray Borges Santana

Reporter:: Brian Stansberry

Votes:: 0 Vote for this issue

Watchers:: 3 Start watching this issue

Created:: 2022/12/02 4:43 PM

Updated:: 2023/01/11 2:50 PM

Details

Description

Attachments

Issue Links

Easy Agile Planning Poker

Activity

People

Dates