-
Bug
-
Resolution: Unresolved
-
Critical
-
None
-
httpd 2.4.57.SP6 GA
-
None
-
False
-
None
-
False
-
Workaround Exists
-
-
-
-
After JBCS-1582 / MODCLUSTER-785, there are still some recovery issues for the http worker if websocket is enabled. The STATUS MCMP is only removing the error state on the child process that handles it. So in a config running multiple other child processes, 503s can continue on the other child processes possibly indefinitely or until some random point in the future when the other child processes each also receive a STATUS MCMP themselves.
In my reproduced test without a long keepalive timeout, you can see error states begin here:
127.0.0.1 - - [13/Nov/2024:17:50:13 -0500] "POST /helloworld/HelloWorld?sleep=1001 HTTP/1.0" 503 283 http://127.0.0.1:8080 2399166 140548664985280
After the process is resumed, this is the next status update served by process 2398422 and 503s continue across all processes except 2398422
127.0.0.1 - - [13/Nov/2024:17:50:35 -0500] "STATUS / HTTP/1.1" 200 70 - 2398422 140548675471040 ... 127.0.0.1 - - [13/Nov/2024:17:50:36 -0500] "POST /helloworld/HelloWorld?sleep=1002 HTTP/1.0" 503 299 http://127.0.0.1:8080 2399142 140548664985280 127.0.0.1 - - [13/Nov/2024:17:50:36 -0500] "POST /helloworld/HelloWorld?sleep=1002 HTTP/1.0" 503 299 http://127.0.0.1:8080 2399165 140548664985280 127.0.0.1 - - [13/Nov/2024:17:50:36 -0500] "POST /helloworld/HelloWorld?sleep=1002 HTTP/1.0" 503 299 http://127.0.0.1:8080 2398913 140548664985280 127.0.0.1 - - [13/Nov/2024:17:50:36 -0500] "POST /helloworld/HelloWorld?sleep=1002 HTTP/1.0" 503 299 http://127.0.0.1:8080 2398834 140548664985280 127.0.0.1 - - [13/Nov/2024:17:50:36 -0500] "POST /helloworld/HelloWorld?sleep=1002 HTTP/1.0" 503 299 http://127.0.0.1:8080 2398942 140548664985280 127.0.0.1 - - [13/Nov/2024:17:50:36 -0500] "POST /helloworld/HelloWorld?sleep=1002 HTTP/1.0" 503 299 http://127.0.0.1:8080 2398901 140548675471040 127.0.0.1 - - [13/Nov/2024:17:50:36 -0500] "POST /helloworld/HelloWorld?sleep=1002 HTTP/1.0" 503 299 http://127.0.0.1:8080 2398198 140548664985280 127.0.0.1 - - [13/Nov/2024:17:50:36 -0500] "POST /helloworld/HelloWorld?sleep=1002 HTTP/1.0" 503 299 http://127.0.0.1:8080 2398381 140548675471040 127.0.0.1 - - [13/Nov/2024:17:50:36 -0500] "POST /helloworld/HelloWorld?sleep=1002 HTTP/1.0" 503 299 http://127.0.0.1:8080 2399166 140548664985280 127.0.0.1 - - [13/Nov/2024:17:50:36 -0500] "POST /helloworld/HelloWorld?sleep=1002 HTTP/1.0" 503 299 http://127.0.0.1:8080 2398218 140548664985280 127.0.0.1 - - [13/Nov/2024:17:50:36 -0500] "POST /helloworld/HelloWorld?sleep=1002 HTTP/1.0" 200 470 http://127.0.0.1:8080 2398422 140548664985280
The next STATUS is served by process 2398199 so now 2398199 and 2398422 are the only ones serving 200s:
127.0.0.1 - - [13/Nov/2024:17:50:45 -0500] "STATUS / HTTP/1.1" 200 70 - 2398199 140548664985280 ... 127.0.0.1 - - [13/Nov/2024:17:50:53 -0500] "POST /helloworld/HelloWorld?sleep=1003 HTTP/1.0" 503 299 http://127.0.0.1:8080 2399166 140548664985280 127.0.0.1 - - [13/Nov/2024:17:50:53 -0500] "POST /helloworld/HelloWorld?sleep=1003 HTTP/1.0" 503 299 http://127.0.0.1:8080 2398377 140548664985280 127.0.0.1 - - [13/Nov/2024:17:50:53 -0500] "POST /helloworld/HelloWorld?sleep=1003 HTTP/1.0" 503 299 http://127.0.0.1:8080 2398218 140548664985280 127.0.0.1 - - [13/Nov/2024:17:50:53 -0500] "POST /helloworld/HelloWorld?sleep=1003 HTTP/1.0" 503 299 http://127.0.0.1:8080 2398216 140548664985280 127.0.0.1 - - [13/Nov/2024:17:50:53 -0500] "POST /helloworld/HelloWorld?sleep=1003 HTTP/1.0" 503 299 http://127.0.0.1:8080 2398189 140548664985280 127.0.0.1 - - [13/Nov/2024:17:50:53 -0500] "POST /helloworld/HelloWorld?sleep=1003 HTTP/1.0" 503 299 http://127.0.0.1:8080 2398955 140548675471040 127.0.0.1 - - [13/Nov/2024:17:50:53 -0500] "POST /helloworld/HelloWorld?sleep=1003 HTTP/1.0" 503 299 http://127.0.0.1:8080 2398378 140548675471040 127.0.0.1 - - [13/Nov/2024:17:50:53 -0500] "POST /helloworld/HelloWorld?sleep=1003 HTTP/1.0" 503 299 http://127.0.0.1:8080 2398381 140548675471040 127.0.0.1 - - [13/Nov/2024:17:50:53 -0500] "POST /helloworld/HelloWorld?sleep=1003 HTTP/1.0" 503 299 http://127.0.0.1:8080 2398198 140548664985280 127.0.0.1 - - [13/Nov/2024:17:50:53 -0500] "POST /helloworld/HelloWorld?sleep=1003 HTTP/1.0" 503 299 http://127.0.0.1:8080 2398901 140548675471040 127.0.0.1 - - [13/Nov/2024:17:50:53 -0500] "POST /helloworld/HelloWorld?sleep=1003 HTTP/1.0" 200 470 http://127.0.0.1:8080 2398422 140548664985280 127.0.0.1 - - [13/Nov/2024:17:50:53 -0500] "POST /helloworld/HelloWorld?sleep=1003 HTTP/1.0" 200 470 http://127.0.0.1:8080 2398199 140548675471040 127.0.0.1 - - [13/Nov/2024:17:50:53 -0500] "POST /helloworld/HelloWorld?sleep=1003 HTTP/1.0" 200 470 http://127.0.0.1:8080 2398422 140548675471040 127.0.0.1 - - [13/Nov/2024:17:50:53 -0500] "POST /helloworld/HelloWorld?sleep=1003 HTTP/1.0" 200 470 http://127.0.0.1:8080 2398199 140548664985280
That trend continues with only new child processes working again after they've served a STATUS request. If you test with a long KeepAliveTimeout, all STATUS requests likely hit the same child process so that other child processes aren't prompted to recover. For the processes that haven't received a STATUS update, they continue to disable the http worker and prolong its 503s on that process:
[Wed Nov 13 17:51:44.197196 2024] [proxy:debug] [pid 2398955:tid 2398974] proxy_util.c(2301): AH00933: http: too soon to retry worker for (127.0.0.1:8080) [Wed Nov 13 17:51:44.197201 2024] [proxy:debug] [pid 2398218:tid 2398241] proxy_util.c(2301): AH00933: http: too soon to retry worker for (127.0.0.1:8080) [Wed Nov 13 17:51:44.197202 2024] [proxy:debug] [pid 2398221:tid 2398249] proxy_util.c(2301): AH00933: http: too soon to retry worker for (127.0.0.1:8080) [Wed Nov 13 17:51:44.197204 2024] [proxy:error] [pid 2398955:tid 2398974] AH00940: http: disabled connection for (127.0.0.1:8080) [Wed Nov 13 17:51:44.197207 2024] [proxy:error] [pid 2398218:tid 2398241] AH00940: http: disabled connection for (127.0.0.1:8080) [Wed Nov 13 17:51:44.197210 2024] [proxy:error] [pid 2398221:tid 2398249] AH00940: http: disabled connection for (127.0.0.1:8080) [Wed Nov 13 17:51:44.197214 2024] [:debug] [pid 2398955:tid 2398974] mod_proxy_cluster.c(3826): proxy_cluster_post_request 503 for (balancer://mycluster) [Wed Nov 13 17:51:44.197219 2024] [:debug] [pid 2398218:tid 2398241] mod_proxy_cluster.c(3826): proxy_cluster_post_request 503 for (balancer://mycluster) [Wed Nov 13 17:51:44.197222 2024] [:debug] [pid 2398221:tid 2398249] mod_proxy_cluster.c(3826): proxy_cluster_post_request 503 for (balancer://mycluster)
But if you wait 60+ seconds as in the last reproducer iteration, then all the child processes lift the disabled state so 503s stop.
Ideally, we need to ensure the http worker recovers across all processes in a timely manner with the first status update.
- clones
-
MODCLUSTER-841 mod_proxy_cluster still does not fully recover http member if websocket is enabled
- Resolved
- is related to
-
MODCLUSTER-785 mod_proxy_cluster does not recover http member if websocket is enabled
- Resolved
-
JBCS-1582 mod_proxy_cluster does not recover http member if websocket is enabled
- Closed