[WFCORE-218] wildfly web management console hangs during deploy from cli

Type: Bug
Resolution: Unresolved
Priority: Major
Fix Version/s: None
Affects Version/s: 1.0.0.Alpha1
Component/s: Management
Labels:
- domain-mode

We are running wildfly in domain mode with the following configuration.

host A running domain controlller
host B running host controller with one app sever
host C running host controller with one app server
host D running host controller with one app server

When we deloy war using jboss-cli the web console is blocked for usage until deploy completes. I have run jvisualvm and it does not appear that domain controller process is starved for resources (cpu, memory, threads).

- - Sort By Name
  - Sort By Date
  - Ascending
  - Descending
  - Thumbnails
  - List
  - Download All

threaddump-1415735255304.tdump
45 kB
2014/11/11 3:03 PM

is cloned by

JBEAP-7943 wildfly web management console hangs during deploy from cli

Closed

relates to

WFCORE-586 domain controller does not timeout on bad app deploy

Closed

1.	Avoiding unnecessary 2-phase execution of composite operations in a managed domain		Open		Unassigned
2.	Guard domain topology changes with separate locks from the controller lock		Open		Unassigned

Brian Stansberry added a comment - 2016/11/28 2:54 PM - edited

I briefly considered not holding any long lasting topology lock and simply getting the set of hosts under a short lived lock. But that is not reliable:

1) T1 is doing a domain-wide write, on DC OperationCoordinatorStepHandler gathers the registered servers and creates DomainSlaveHandler to do the HC rollout.
2) New HC starts, connects, gets exclusive lock, starts registration stuff.
3) T1 gets to the Stage.MODEL handler that detects a write, tries to get exclusive lock, blocks
4) New HC reg is completed, exclusive lock released
5) T1 gets lock, proceeds
6) T1 gets to DomainSlaveHandler, rolls out the change to the set of slaves provided in 1) above, which does not include New HC.
7) New HC misses the update.

The situation with servers I believe is simpler. There the set of host and server proxies is a ref to the complete, dynamically updated set. Which servers get called depends on the rollout plan. The rollout plan is created after Stage.MODEL, so the exclusive lock will be held when it is created. So any "New Server" joining in a race with the change will either a) block in registration acquiring the exclusive lock until after the change is complete or b) cause the change to block in Stage.MODEL until reg is complete, with New Server then being picked up by DomainRolloutStepHandler the same as if it had been registered before the change op even began.

The way the server case is handled by DomainRolloutStepHandler suggests a possible easy fix for the host case as well. DomainSlaveHandler should be constructed with a ref to the complete dynamically updated map of host proxies (the way DomainRolloutStepHandler is). It should also be given the set of host names to update, or null if the update is global. If the list of host names is not null, that means the op only targets particular hosts, with no possibility of that set being added to in the course of execution. So, if if the change is global, the write lock in a Stage.MODEL step will ensure that any new host is either registered before DomainSlaveHandler executes, or is blocking waiting for the change op to complete. If the change is not global, the registration of a new slave is irrelevant to DomainSlaveHandler; it just works with the set of hosts it knows about.

Reads still need some thought though. The current behavior of overly aggressively taking the exclusive lock prevents some possible scenarios, like a client periodically reading a bunch of metrics getting a failure because a host or server is removed by another op in the middle of the read. This could be a real scenario now that things like multi-process reads and the query op are supported.

Brian Stansberry added a comment - 2016/11/28 2:54 PM - edited I briefly considered not holding any long lasting topology lock and simply getting the set of hosts under a short lived lock. But that is not reliable: 1) T1 is doing a domain-wide write, on DC OperationCoordinatorStepHandler gathers the registered servers and creates DomainSlaveHandler to do the HC rollout. 2) New HC starts, connects, gets exclusive lock, starts registration stuff. 3) T1 gets to the Stage.MODEL handler that detects a write, tries to get exclusive lock, blocks 4) New HC reg is completed, exclusive lock released 5) T1 gets lock, proceeds 6) T1 gets to DomainSlaveHandler, rolls out the change to the set of slaves provided in 1) above, which does not include New HC. 7) New HC misses the update. The situation with servers I believe is simpler. There the set of host and server proxies is a ref to the complete, dynamically updated set. Which servers get called depends on the rollout plan. The rollout plan is created after Stage.MODEL, so the exclusive lock will be held when it is created. So any "New Server" joining in a race with the change will either a) block in registration acquiring the exclusive lock until after the change is complete or b) cause the change to block in Stage.MODEL until reg is complete, with New Server then being picked up by DomainRolloutStepHandler the same as if it had been registered before the change op even began. The way the server case is handled by DomainRolloutStepHandler suggests a possible easy fix for the host case as well. DomainSlaveHandler should be constructed with a ref to the complete dynamically updated map of host proxies (the way DomainRolloutStepHandler is). It should also be given the set of host names to update, or null if the update is global. If the list of host names is not null, that means the op only targets particular hosts, with no possibility of that set being added to in the course of execution. So, if if the change is global, the write lock in a Stage.MODEL step will ensure that any new host is either registered before DomainSlaveHandler executes, or is blocking waiting for the change op to complete. If the change is not global, the registration of a new slave is irrelevant to DomainSlaveHandler; it just works with the set of hosts it knows about. Reads still need some thought though. The current behavior of overly aggressively taking the exclusive lock prevents some possible scenarios, like a client periodically reading a bunch of metrics getting a failure because a host or server is removed by another op in the middle of the read. This could be a real scenario now that things like multi-process reads and the query op are supported.

Brian Stansberry added a comment - 2014/11/14 11:12 AM

A third possibility is to not do 2) above but instead focus on discriminating 2-phase ops that are purely reads from those that may involve writes. Evaluate to what degree topology stability is important to different cases, and see if the need for acquiring a lock to ensure it can be dropped in important cases.

The biggest issue is if a 2-phase read is executing but one of the target processes leaves the domain, that needs to be handled in a reasonable fashion. This requirement already applies in the case of a target process crashing though.

Brian Stansberry added a comment - 2014/11/14 11:12 AM A third possibility is to not do 2) above but instead focus on discriminating 2-phase ops that are purely reads from those that may involve writes. Evaluate to what degree topology stability is important to different cases, and see if the need for acquiring a lock to ensure it can be dropped in important cases. The biggest issue is if a 2-phase read is executing but one of the target processes leaves the domain, that needs to be handled in a reasonable fashion. This requirement already applies in the case of a target process crashing though.

Brian Stansberry added a comment - 2014/11/14 10:59 AM

The thread dump shows 2 console request threads blocking waiting for the exclusive controller lock that is held by the thread executing the deployment op.

The console request threads are asking for the exclusive lock so they can execute a 2-phase domain wide operation. All such ops get the exclusive lock so they can prevent domain topology changes during execution.

There are two ways I want to improve this:

1) I believe when the console sends a composite operation, it is getting routed into the 2-phase domain wide operation path more often than is necessary. Improving this needs to be done carefully, but should be a relatively straightforward change and probably will have the biggest impact, as most of the CRUD screens in the console don't need to invoke ops that involve more than the domain controller.

2) Look into having a separate lock for the domain topology, and not using the exclusive controller lock to guard it. That requires care though, as now there will be two separate locks involved in operation execution. We need to be certain that all code paths always acquire them in the same order or we'll be vulnerable to deadlocks. I believe the correct order should be 1) topology lock 2) controller lock. There are relatively few points where a topology lock would be needed, and I believe they are all at the outer edge of operation execution. So it's much simpler to control those points and ensure they always get topology before doing anything that could need the controller lock.

This second step will be more important once the feature discussed at http://lists.jboss.org/pipermail/wildfly-dev/2014-October/003241.html comes in, as that will result in numerous read operations that truly need the domain topology lock.

Brian Stansberry added a comment - 2014/11/14 10:59 AM The thread dump shows 2 console request threads blocking waiting for the exclusive controller lock that is held by the thread executing the deployment op. The console request threads are asking for the exclusive lock so they can execute a 2-phase domain wide operation. All such ops get the exclusive lock so they can prevent domain topology changes during execution. There are two ways I want to improve this: 1) I believe when the console sends a composite operation, it is getting routed into the 2-phase domain wide operation path more often than is necessary. Improving this needs to be done carefully, but should be a relatively straightforward change and probably will have the biggest impact, as most of the CRUD screens in the console don't need to invoke ops that involve more than the domain controller. 2) Look into having a separate lock for the domain topology, and not using the exclusive controller lock to guard it. That requires care though, as now there will be two separate locks involved in operation execution. We need to be certain that all code paths always acquire them in the same order or we'll be vulnerable to deadlocks. I believe the correct order should be 1) topology lock 2) controller lock. There are relatively few points where a topology lock would be needed, and I believe they are all at the outer edge of operation execution. So it's much simpler to control those points and ensure they always get topology before doing anything that could need the controller lock. This second step will be more important once the feature discussed at http://lists.jboss.org/pipermail/wildfly-dev/2014-October/003241.html comes in, as that will result in numerous read operations that truly need the domain topology lock.

Ian Kent (Inactive) added a comment - 2014/11/11 2:52 PM

I am using WildFly 8.1.0.Final.

When I am deploying a war via jboss-cli the wildfly management console is not available. I just get a spinning icon.
I used jvisual vm to connect to jmx of domain controller java process.
service:jmx:http-remoting-jmx://[host]:9990

I did a thread dump using jvisualvm and it is attached.

Ian Kent (Inactive) added a comment - 2014/11/11 2:52 PM I am using WildFly 8.1.0.Final. When I am deploying a war via jboss-cli the wildfly management console is not available. I just get a spinning icon. I used jvisual vm to connect to jmx of domain controller java process. service:jmx:http-remoting-jmx:// [host] :9990 I did a thread dump using jvisualvm and it is attached.

Brian Stansberry added a comment - 2014/11/05 4:23 PM

Please provide version information, and also thread dumps from each of the hosts (all processes on each host; e.g. from killall -3 java).

Brian Stansberry added a comment - 2014/11/05 4:23 PM Please provide version information, and also thread dumps from each of the hosts (all processes on each host; e.g. from killall -3 java).

Details

Description

Attachments

Attachments

Issue Links

Easy Agile Planning Poker

Sub-Tasks

Activity

Collapse comment: Brian Stansberry added a comment - 2016/11/28 2:54 PM, Edited by Brian Stansberry - 2016/11/28 3:52 PM

Expand comment: Brian Stansberry added a comment - 2016/11/28 2:54 PM, Edited by Brian Stansberry - 2016/11/28 3:52 PM

Collapse comment: Brian Stansberry added a comment - 2014/11/14 11:12 AM

Expand comment: Brian Stansberry added a comment - 2014/11/14 11:12 AM

Collapse comment: Brian Stansberry added a comment - 2014/11/14 10:59 AM

Expand comment: Brian Stansberry added a comment - 2014/11/14 10:59 AM

Collapse comment: Ian Kent (Inactive) added a comment - 2014/11/11 2:52 PM

Expand comment: Ian Kent (Inactive) added a comment - 2014/11/11 2:52 PM

Collapse comment: Brian Stansberry added a comment - 2014/11/05 4:23 PM

Expand comment: Brian Stansberry added a comment - 2014/11/05 4:23 PM

People

Dates