Loading...

XML

Word

Printable

Type: Bug
Resolution: Unresolved
Priority: Critical
Fix Version/s: None
Affects Version/s: httpd 2.4.57.SP6 GA
Component/s: mod_proxy_cluster
Labels:
None

Blocked:
False
Blocked Reason:
None
Ready:
False
Workaround:

Workaround Exists
Workaround Description:

Hide

To limit time error states may remain in child processes, disable keepalive in the mod_cluster VirtualHost and configure mpm settings to use fewer processes with a higher thread count instead of more processes with a lower thread count.

Show
To limit time error states may remain in child processes, disable keepalive in the mod_cluster VirtualHost and configure mpm settings to use fewer processes with a higher thread count instead of more processes with a lower thread count.
Steps to Reproduce:
Hide

1. set up apache with mod_proxy_cluster and websocket enabled (EnableWsTunnel)
2. Configure httpd with a long KeepAliveTimeout and an mpm config using many processes with few threads (this worsens the problem of more child processes not serving a STATUS update themselves), for example:

KeepAliveTimeout 60 LoadModule mpm_worker_module modules/mod_mpm_worker.so <IfModule worker.c> StartServers 2 ServerLimit 50 MaxRequestWorkers 100 MinSpareThreads 25 MaxSpareThreads 100 ThreadsPerChild 2 MaxConnectionsPerChild 0 </IfModule>

3. start jboss/tomcat connected with httpd via HTTP and an app deployed. An app serving a POST with a sleep delay helps for testing so you can use the attached helloworld.war for that.
4. Use the following one liner to run a test sequence. The requests with a little sleep delay help guarantee process saturation:

ab -p post.txt -c 200 -n 200 "127.0.0.1/helloworld/HelloWorld?sleep=1000"; killall -STOP java; ab -p post.txt -c 200 -n 200 "127.0.0.1/helloworld/HelloWorld?sleep=1001"; sleep 1; killall -CONT java; sleep 1; ab -p post.txt -c 200 -n 200 "127.0.0.1/helloworld/HelloWorld?sleep=1002"; sleep 15; ab -p post.txt -c 200 -n 200 "127.0.0.1/helloworld/HelloWorld?sleep=1003"; sleep 15; ab -p post.txt -c 200 -n 200 "127.0.0.1/helloworld/HelloWorld?sleep=1004"; sleep 15; ab -p post.txt -c 200 -n 200 "127.0.0.1/helloworld/HelloWorld?sleep=1005"; sleep 15; ab -p post.txt -c 200 -n 200 "127.0.0.1/helloworld/HelloWorld?sleep=1006"; sleep 61; ab -p post.txt -c 200 -n 200 "127.0.0.1/helloworld/HelloWorld?sleep=1007"

After the kill -STOP, sleep=1001 requests saturate the processes with error states. After the process is continued, note requests only succeed in processes that have served a STATUS MCMP. After a 60+ second wait, you can see the sleep=1007 requests are successful across all the processes.
Show
1. set up apache with mod_proxy_cluster and websocket enabled (EnableWsTunnel) 2. Configure httpd with a long KeepAliveTimeout and an mpm config using many processes with few threads (this worsens the problem of more child processes not serving a STATUS update themselves), for example: KeepAliveTimeout 60 LoadModule mpm_worker_module modules/mod_mpm_worker.so <IfModule worker.c> StartServers 2 ServerLimit 50 MaxRequestWorkers 100 MinSpareThreads 25 MaxSpareThreads 100 ThreadsPerChild 2 MaxConnectionsPerChild 0 </IfModule> 3. start jboss/tomcat connected with httpd via HTTP and an app deployed. An app serving a POST with a sleep delay helps for testing so you can use the attached helloworld.war for that. 4. Use the following one liner to run a test sequence. The requests with a little sleep delay help guarantee process saturation: ab -p post.txt -c 200 -n 200 "127.0.0.1/helloworld/HelloWorld?sleep=1000" ; killall -STOP java; ab -p post.txt -c 200 -n 200 "127.0.0.1/helloworld/HelloWorld?sleep=1001" ; sleep 1; killall -CONT java; sleep 1; ab -p post.txt -c 200 -n 200 "127.0.0.1/helloworld/HelloWorld?sleep=1002" ; sleep 15; ab -p post.txt -c 200 -n 200 "127.0.0.1/helloworld/HelloWorld?sleep=1003" ; sleep 15; ab -p post.txt -c 200 -n 200 "127.0.0.1/helloworld/HelloWorld?sleep=1004" ; sleep 15; ab -p post.txt -c 200 -n 200 "127.0.0.1/helloworld/HelloWorld?sleep=1005" ; sleep 15; ab -p post.txt -c 200 -n 200 "127.0.0.1/helloworld/HelloWorld?sleep=1006" ; sleep 61; ab -p post.txt -c 200 -n 200 "127.0.0.1/helloworld/HelloWorld?sleep=1007" After the kill -STOP, sleep=1001 requests saturate the processes with error states. After the process is continued, note requests only succeed in processes that have served a STATUS MCMP. After a 60+ second wait, you can see the sleep=1007 requests are successful across all the processes.
Intelligence Requested:
Market:

SFDC Cases Links:
SFDC Cases Counter:
SFDC Cases Open:

After ~~JBCS-1582~~ / ~~MODCLUSTER-785~~, there are still some recovery issues for the http worker if websocket is enabled. The STATUS MCMP is only removing the error state on the child process that handles it. So in a config running multiple other child processes, 503s can continue on the other child processes possibly indefinitely or until some random point in the future when the other child processes each also receive a STATUS MCMP themselves.

In my reproduced test without a long keepalive timeout, you can see error states begin here:

127.0.0.1 - - [13/Nov/2024:17:50:13 -0500] "POST /helloworld/HelloWorld?sleep=1001 HTTP/1.0" 503 283 http://127.0.0.1:8080 2399166 140548664985280

After the process is resumed, this is the next status update served by process 2398422 and 503s continue across all processes except 2398422

127.0.0.1 - - [13/Nov/2024:17:50:35 -0500] "STATUS / HTTP/1.1" 200 70 - 2398422 140548675471040
...
127.0.0.1 - - [13/Nov/2024:17:50:36 -0500] "POST /helloworld/HelloWorld?sleep=1002 HTTP/1.0" 503 299 http://127.0.0.1:8080 2399142 140548664985280
127.0.0.1 - - [13/Nov/2024:17:50:36 -0500] "POST /helloworld/HelloWorld?sleep=1002 HTTP/1.0" 503 299 http://127.0.0.1:8080 2399165 140548664985280
127.0.0.1 - - [13/Nov/2024:17:50:36 -0500] "POST /helloworld/HelloWorld?sleep=1002 HTTP/1.0" 503 299 http://127.0.0.1:8080 2398913 140548664985280
127.0.0.1 - - [13/Nov/2024:17:50:36 -0500] "POST /helloworld/HelloWorld?sleep=1002 HTTP/1.0" 503 299 http://127.0.0.1:8080 2398834 140548664985280
127.0.0.1 - - [13/Nov/2024:17:50:36 -0500] "POST /helloworld/HelloWorld?sleep=1002 HTTP/1.0" 503 299 http://127.0.0.1:8080 2398942 140548664985280
127.0.0.1 - - [13/Nov/2024:17:50:36 -0500] "POST /helloworld/HelloWorld?sleep=1002 HTTP/1.0" 503 299 http://127.0.0.1:8080 2398901 140548675471040
127.0.0.1 - - [13/Nov/2024:17:50:36 -0500] "POST /helloworld/HelloWorld?sleep=1002 HTTP/1.0" 503 299 http://127.0.0.1:8080 2398198 140548664985280
127.0.0.1 - - [13/Nov/2024:17:50:36 -0500] "POST /helloworld/HelloWorld?sleep=1002 HTTP/1.0" 503 299 http://127.0.0.1:8080 2398381 140548675471040
127.0.0.1 - - [13/Nov/2024:17:50:36 -0500] "POST /helloworld/HelloWorld?sleep=1002 HTTP/1.0" 503 299 http://127.0.0.1:8080 2399166 140548664985280
127.0.0.1 - - [13/Nov/2024:17:50:36 -0500] "POST /helloworld/HelloWorld?sleep=1002 HTTP/1.0" 503 299 http://127.0.0.1:8080 2398218 140548664985280
127.0.0.1 - - [13/Nov/2024:17:50:36 -0500] "POST /helloworld/HelloWorld?sleep=1002 HTTP/1.0" 200 470 http://127.0.0.1:8080 2398422 140548664985280

The next STATUS is served by process 2398199 so now 2398199 and 2398422 are the only ones serving 200s:

127.0.0.1 - - [13/Nov/2024:17:50:45 -0500] "STATUS / HTTP/1.1" 200 70 - 2398199 140548664985280
...
127.0.0.1 - - [13/Nov/2024:17:50:53 -0500] "POST /helloworld/HelloWorld?sleep=1003 HTTP/1.0" 503 299 http://127.0.0.1:8080 2399166 140548664985280
127.0.0.1 - - [13/Nov/2024:17:50:53 -0500] "POST /helloworld/HelloWorld?sleep=1003 HTTP/1.0" 503 299 http://127.0.0.1:8080 2398377 140548664985280
127.0.0.1 - - [13/Nov/2024:17:50:53 -0500] "POST /helloworld/HelloWorld?sleep=1003 HTTP/1.0" 503 299 http://127.0.0.1:8080 2398218 140548664985280
127.0.0.1 - - [13/Nov/2024:17:50:53 -0500] "POST /helloworld/HelloWorld?sleep=1003 HTTP/1.0" 503 299 http://127.0.0.1:8080 2398216 140548664985280
127.0.0.1 - - [13/Nov/2024:17:50:53 -0500] "POST /helloworld/HelloWorld?sleep=1003 HTTP/1.0" 503 299 http://127.0.0.1:8080 2398189 140548664985280
127.0.0.1 - - [13/Nov/2024:17:50:53 -0500] "POST /helloworld/HelloWorld?sleep=1003 HTTP/1.0" 503 299 http://127.0.0.1:8080 2398955 140548675471040
127.0.0.1 - - [13/Nov/2024:17:50:53 -0500] "POST /helloworld/HelloWorld?sleep=1003 HTTP/1.0" 503 299 http://127.0.0.1:8080 2398378 140548675471040
127.0.0.1 - - [13/Nov/2024:17:50:53 -0500] "POST /helloworld/HelloWorld?sleep=1003 HTTP/1.0" 503 299 http://127.0.0.1:8080 2398381 140548675471040
127.0.0.1 - - [13/Nov/2024:17:50:53 -0500] "POST /helloworld/HelloWorld?sleep=1003 HTTP/1.0" 503 299 http://127.0.0.1:8080 2398198 140548664985280
127.0.0.1 - - [13/Nov/2024:17:50:53 -0500] "POST /helloworld/HelloWorld?sleep=1003 HTTP/1.0" 503 299 http://127.0.0.1:8080 2398901 140548675471040
127.0.0.1 - - [13/Nov/2024:17:50:53 -0500] "POST /helloworld/HelloWorld?sleep=1003 HTTP/1.0" 200 470 http://127.0.0.1:8080 2398422 140548664985280
127.0.0.1 - - [13/Nov/2024:17:50:53 -0500] "POST /helloworld/HelloWorld?sleep=1003 HTTP/1.0" 200 470 http://127.0.0.1:8080 2398199 140548675471040
127.0.0.1 - - [13/Nov/2024:17:50:53 -0500] "POST /helloworld/HelloWorld?sleep=1003 HTTP/1.0" 200 470 http://127.0.0.1:8080 2398422 140548675471040
127.0.0.1 - - [13/Nov/2024:17:50:53 -0500] "POST /helloworld/HelloWorld?sleep=1003 HTTP/1.0" 200 470 http://127.0.0.1:8080 2398199 140548664985280

That trend continues with only new child processes working again after they've served a STATUS request. If you test with a long KeepAliveTimeout, all STATUS requests likely hit the same child process so that other child processes aren't prompted to recover. For the processes that haven't received a STATUS update, they continue to disable the http worker and prolong its 503s on that process:

[Wed Nov 13 17:51:44.197196 2024] [proxy:debug] [pid 2398955:tid 2398974] proxy_util.c(2301): AH00933: http: too soon to retry worker for (127.0.0.1:8080)
[Wed Nov 13 17:51:44.197201 2024] [proxy:debug] [pid 2398218:tid 2398241] proxy_util.c(2301): AH00933: http: too soon to retry worker for (127.0.0.1:8080)
[Wed Nov 13 17:51:44.197202 2024] [proxy:debug] [pid 2398221:tid 2398249] proxy_util.c(2301): AH00933: http: too soon to retry worker for (127.0.0.1:8080)
[Wed Nov 13 17:51:44.197204 2024] [proxy:error] [pid 2398955:tid 2398974] AH00940: http: disabled connection for (127.0.0.1:8080)
[Wed Nov 13 17:51:44.197207 2024] [proxy:error] [pid 2398218:tid 2398241] AH00940: http: disabled connection for (127.0.0.1:8080)
[Wed Nov 13 17:51:44.197210 2024] [proxy:error] [pid 2398221:tid 2398249] AH00940: http: disabled connection for (127.0.0.1:8080)
[Wed Nov 13 17:51:44.197214 2024] [:debug] [pid 2398955:tid 2398974] mod_proxy_cluster.c(3826): proxy_cluster_post_request 503 for (balancer://mycluster)
[Wed Nov 13 17:51:44.197219 2024] [:debug] [pid 2398218:tid 2398241] mod_proxy_cluster.c(3826): proxy_cluster_post_request 503 for (balancer://mycluster)
[Wed Nov 13 17:51:44.197222 2024] [:debug] [pid 2398221:tid 2398249] mod_proxy_cluster.c(3826): proxy_cluster_post_request 503 for (balancer://mycluster)

But if you wait 60+ seconds as in the last reproducer iteration, then all the child processes lift the disabled state so 503s stop.

Ideally, we need to ensure the http worker recovers across all processes in a timely manner with the first status update.

- - Sort By Name
  - Sort By Date
  - Ascending
  - Descending
  - Thumbnails
  - List
  - Download All

reproducerlogs.tar.xz
285 kB
2024/11/14 1:06 AM
helloworld.war
20 kB
2024/11/14 1:06 AM

clones

MODCLUSTER-841 mod_proxy_cluster still does not fully recover http member if websocket is enabled

Pull Request Sent

is related to

MODCLUSTER-785 mod_proxy_cluster does not recover http member if websocket is enabled

Resolved

JBCS-1582 mod_proxy_cluster does not recover http member if websocket is enabled

Closed

Assignee:: Jean-Frederic Clere

Reporter:: Aaron Ogburn

Votes:: 0 Vote for this issue

Watchers:: 3 Start watching this issue

Created:: 2024/11/14 1:06 AM

Updated:: 2024/11/14 4:03 PM

Details

Description

Attachments

Attachments

Issue Links

Easy Agile Planning Poker

Activity

People

Dates