-
Bug
-
Resolution: Done
-
Minor
-
None
-
None
If a managed server has issues causing it to be unresponsive (like an OOME), then certain parts of the domain console also hang and become non-functional like the topology or server groups view. This is easy to reproduce by forcing one of the managed server instances to a pause state like so:
$ kill -STOP $MANAGED_SERVER_PID
Then in the domain console navigate to Runtime->Topology and it'll spin on a loading state. Thread dumps of the domain controller show requests stuck like below, naturally awaiting a response from the hung managed server:
"External Management Request Threads -- 4" #85 prio=5 os_prio=0 tid=0x000055ae6a589800 nid=0x4d53 waiting on condition [0x00007efd0c452000] java.lang.Thread.State: TIMED_WAITING (parking) at sun.misc.Unsafe.park(Native Method) - parking to wait for <0x00000000f8917138> (a java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject) at java.util.concurrent.locks.LockSupport.parkNanos(LockSupport.java:215) at java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.awaitNanos(AbstractQueuedSynchronizer.java:2078) at java.util.concurrent.ArrayBlockingQueue.poll(ArrayBlockingQueue.java:418) at org.jboss.as.controller.remote.RemoteProxyController.execute(RemoteProxyController.java:170) at org.jboss.as.controller.TransformingProxyController$Factory$TransformingProxyControllerImpl.execute(TransformingProxyController.java:203) at org.jboss.as.controller.ProxyStepHandler.execute(ProxyStepHandler.java:180) at org.jboss.as.controller.operations.global.GlobalOperationHandlers$FilterableRemoteOperationStepHandler.execute(GlobalOperationHandlers.java:860) at org.jboss.as.controller.AbstractOperationContext.executeStep(AbstractOperationContext.java:1047) at org.jboss.as.controller.AbstractOperationContext.processStages(AbstractOperationContext.java:779) at org.jboss.as.controller.AbstractOperationContext.executeOperation(AbstractOperationContext.java:468) at org.jboss.as.controller.OperationContextImpl.executeOperation(OperationContextImpl.java:1425) at org.jboss.as.controller.ModelControllerImpl.internalExecute(ModelControllerImpl.java:449) at org.jboss.as.controller.ModelControllerImpl.lambda$executeForResponse$0(ModelControllerImpl.java:260) at org.jboss.as.controller.ModelControllerImpl$$Lambda$581/1350522046.run(Unknown Source) at org.wildfly.security.auth.server.SecurityIdentity$$Lambda$582/668496349.run(Unknown Source) at org.wildfly.security.auth.server.SecurityIdentity.runAs(SecurityIdentity.java:304) at org.wildfly.security.auth.server.SecurityIdentity.runAs(SecurityIdentity.java:270) at org.jboss.as.controller.ModelControllerImpl.executeForResponse(ModelControllerImpl.java:260) at org.jboss.as.controller.ModelControllerImpl.executeOperation(ModelControllerImpl.java:254) at org.jboss.as.controller.ModelControllerImpl.execute(ModelControllerImpl.java:237) at org.jboss.as.domain.http.server.DomainApiHandler.handleRequest(DomainApiHandler.java:212) at io.undertow.server.handlers.encoding.EncodingHandler.handleRequest(EncodingHandler.java:72) at org.jboss.as.domain.http.server.DomainApiCheckHandler.handleRequest(DomainApiCheckHandler.java:91) at org.jboss.as.domain.http.server.security.ElytronIdentityHandler.lambda$handleRequest$0(ElytronIdentityHandler.java:62) at org.jboss.as.domain.http.server.security.ElytronIdentityHandler$$Lambda$643/2000590814.run(Unknown Source) at org.wildfly.security.auth.server.SecurityIdentity$$Lambda$644/2094067269.run(Unknown Source) at org.wildfly.security.auth.server.SecurityIdentity.runAs(SecurityIdentity.java:328) at org.wildfly.security.auth.server.SecurityIdentity.runAs(SecurityIdentity.java:285) at org.jboss.as.controller.AccessAuditContext.doAs(AccessAuditContext.java:254) at org.jboss.as.controller.AccessAuditContext.doAs(AccessAuditContext.java:225) at org.jboss.as.domain.http.server.security.ElytronIdentityHandler.handleRequest(ElytronIdentityHandler.java:61) at io.undertow.server.handlers.BlockingHandler.handleRequest(BlockingHandler.java:56) at io.undertow.server.Connectors.executeRootHandler(Connectors.java:387)
That finally times out after 300 seconds with errors like below:
[Host Controller] 11:34:35,893 INFO [org.jboss.as.controller.management-operation] (External Management Request Threads -- 1) WFLYCTL0409: Execution of operation 'query' on remote process at address '[ [Host Controller] ("host" => "master"), [Host Controller] ("server" => "server-three") [Host Controller] ]' timed out after 305000 ms while awaiting initial response; remote process has been notified to terminate operation
So then the threads are released from the above state but the Topology attempt fails and displays no info after that long wait. Testing with -Djboss.as.management.blocking.timeout=10000, I see it times out much quicker honoring this setting as expected for a shorter wait, but there is still no information on the response:
[Host Controller] 11:43:08,256 INFO [org.jboss.as.controller.management-operation] (External Management Request Threads -- 1) WFLYCTL0409: Execution of operation 'read-resource' on remote process at address '[ [Host Controller] ("host" => "master"), [Host Controller] ("server" => "server-three") [Host Controller] ]' timed out after 15000 ms while awaiting initial response; remote process has been notified to terminate operation
I know we may not be investing much more into domain mode so perhaps these wouldn't be pursued at all, but a few concerns/questions here:
1. A large jboss.as.management.blocking.timeout may legitimately be needed for some long app deployment operations, but that long of a timeout doesn't necessarily make sense for something that is expected to be more snappy to a user experience like the topology page in the console. So would it be possible to have separate timeouts to allow needed longer operations but avoid an excessive delay in simpler topology requests?
2. Even the toplogy operation seems all or nothing and so we get no valid display even for all the other good servers if one bad server times out. Could this be improved so we can still at least have a response for the good servers?
- causes
-
HAL-1822 TopologyTasks.RunningServers uses invalid resource addresses
- Resolved
- is cloned by
-
JBEAP-24008 [GSS](7.4.z) HAL-1795 - Domain console is not resilient to unresponsive managed server
- Closed
- is incorporated by
-
WFLY-16529 Upgrade HAL to 3.6.1.Final
- Closed
- is related to
-
HAL-1797 Make item display asynchronous
- Open
- relates to
-
HAL-1417 Test console with slow, brittle or broken network
- Open