Uploaded image for project: 'JBoss Core Services'
  1. JBoss Core Services
  2. JBCS-1888

mod_proxy_cluster still does not fully recover http member if websocket is enabled

XMLWordPrintable

    • Icon: Bug Bug
    • Resolution: Unresolved
    • Icon: Critical Critical
    • None
    • httpd 2.4.57.SP6 GA
    • mod_proxy_cluster
    • None
    • False
    • None
    • False
    • Workaround Exists
    • Hide

      To limit time error states may remain in child processes, disable keepalive in the mod_cluster VirtualHost and configure mpm settings to use fewer processes with a higher thread count instead of more processes with a lower thread count.

      Show
      To limit time error states may remain in child processes, disable keepalive in the mod_cluster VirtualHost and configure mpm settings to use fewer processes with a higher thread count instead of more processes with a lower thread count.
    • Hide

      1. set up apache with mod_proxy_cluster and websocket enabled (EnableWsTunnel)
      2. Configure httpd with a long KeepAliveTimeout and an mpm config using many processes with few threads (this worsens the problem of more child processes not serving a STATUS update themselves), for example:

      KeepAliveTimeout 60
      
      LoadModule mpm_worker_module modules/mod_mpm_worker.so
      
      <IfModule worker.c>
      StartServers                 2
      ServerLimit                 50
      MaxRequestWorkers          100
      MinSpareThreads             25
      MaxSpareThreads            100
      ThreadsPerChild              2
      MaxConnectionsPerChild       0
      </IfModule>
      

      3. start jboss/tomcat connected with httpd via HTTP and an app deployed. An app serving a POST with a sleep delay helps for testing so you can use the attached helloworld.war for that.
      4. Use the following one liner to run a test sequence. The requests with a little sleep delay help guarantee process saturation:

      ab -p post.txt -c 200 -n 200 "127.0.0.1/helloworld/HelloWorld?sleep=1000"; killall -STOP java; ab -p post.txt -c 200 -n 200 "127.0.0.1/helloworld/HelloWorld?sleep=1001"; sleep 1; killall -CONT java; sleep 1; ab -p post.txt -c 200 -n 200 "127.0.0.1/helloworld/HelloWorld?sleep=1002"; sleep 15; ab -p post.txt -c 200 -n 200 "127.0.0.1/helloworld/HelloWorld?sleep=1003"; sleep 15; ab -p post.txt -c 200 -n 200 "127.0.0.1/helloworld/HelloWorld?sleep=1004"; sleep 15; ab -p post.txt -c 200 -n 200 "127.0.0.1/helloworld/HelloWorld?sleep=1005"; sleep 15; ab -p post.txt -c 200 -n 200 "127.0.0.1/helloworld/HelloWorld?sleep=1006"; sleep 61; ab -p post.txt -c 200 -n 200 "127.0.0.1/helloworld/HelloWorld?sleep=1007"
      

      After the kill -STOP, sleep=1001 requests saturate the processes with error states. After the process is continued, note requests only succeed in processes that have served a STATUS MCMP. After a 60+ second wait, you can see the sleep=1007 requests are successful across all the processes.

      Show
      1. set up apache with mod_proxy_cluster and websocket enabled (EnableWsTunnel) 2. Configure httpd with a long KeepAliveTimeout and an mpm config using many processes with few threads (this worsens the problem of more child processes not serving a STATUS update themselves), for example: KeepAliveTimeout 60 LoadModule mpm_worker_module modules/mod_mpm_worker.so <IfModule worker.c> StartServers 2 ServerLimit 50 MaxRequestWorkers 100 MinSpareThreads 25 MaxSpareThreads 100 ThreadsPerChild 2 MaxConnectionsPerChild 0 </IfModule> 3. start jboss/tomcat connected with httpd via HTTP and an app deployed. An app serving a POST with a sleep delay helps for testing so you can use the attached helloworld.war for that. 4. Use the following one liner to run a test sequence. The requests with a little sleep delay help guarantee process saturation: ab -p post.txt -c 200 -n 200 "127.0.0.1/helloworld/HelloWorld?sleep=1000" ; killall -STOP java; ab -p post.txt -c 200 -n 200 "127.0.0.1/helloworld/HelloWorld?sleep=1001" ; sleep 1; killall -CONT java; sleep 1; ab -p post.txt -c 200 -n 200 "127.0.0.1/helloworld/HelloWorld?sleep=1002" ; sleep 15; ab -p post.txt -c 200 -n 200 "127.0.0.1/helloworld/HelloWorld?sleep=1003" ; sleep 15; ab -p post.txt -c 200 -n 200 "127.0.0.1/helloworld/HelloWorld?sleep=1004" ; sleep 15; ab -p post.txt -c 200 -n 200 "127.0.0.1/helloworld/HelloWorld?sleep=1005" ; sleep 15; ab -p post.txt -c 200 -n 200 "127.0.0.1/helloworld/HelloWorld?sleep=1006" ; sleep 61; ab -p post.txt -c 200 -n 200 "127.0.0.1/helloworld/HelloWorld?sleep=1007" After the kill -STOP, sleep=1001 requests saturate the processes with error states. After the process is continued, note requests only succeed in processes that have served a STATUS MCMP. After a 60+ second wait, you can see the sleep=1007 requests are successful across all the processes.

      After JBCS-1582 / MODCLUSTER-785, there are still some recovery issues for the http worker if websocket is enabled. The STATUS MCMP is only removing the error state on the child process that handles it. So in a config running multiple other child processes, 503s can continue on the other child processes possibly indefinitely or until some random point in the future when the other child processes each also receive a STATUS MCMP themselves.

      In my reproduced test without a long keepalive timeout, you can see error states begin here:

      127.0.0.1 - - [13/Nov/2024:17:50:13 -0500] "POST /helloworld/HelloWorld?sleep=1001 HTTP/1.0" 503 283 http://127.0.0.1:8080 2399166 140548664985280
      

      After the process is resumed, this is the next status update served by process 2398422 and 503s continue across all processes except 2398422

      127.0.0.1 - - [13/Nov/2024:17:50:35 -0500] "STATUS / HTTP/1.1" 200 70 - 2398422 140548675471040
      ...
      127.0.0.1 - - [13/Nov/2024:17:50:36 -0500] "POST /helloworld/HelloWorld?sleep=1002 HTTP/1.0" 503 299 http://127.0.0.1:8080 2399142 140548664985280
      127.0.0.1 - - [13/Nov/2024:17:50:36 -0500] "POST /helloworld/HelloWorld?sleep=1002 HTTP/1.0" 503 299 http://127.0.0.1:8080 2399165 140548664985280
      127.0.0.1 - - [13/Nov/2024:17:50:36 -0500] "POST /helloworld/HelloWorld?sleep=1002 HTTP/1.0" 503 299 http://127.0.0.1:8080 2398913 140548664985280
      127.0.0.1 - - [13/Nov/2024:17:50:36 -0500] "POST /helloworld/HelloWorld?sleep=1002 HTTP/1.0" 503 299 http://127.0.0.1:8080 2398834 140548664985280
      127.0.0.1 - - [13/Nov/2024:17:50:36 -0500] "POST /helloworld/HelloWorld?sleep=1002 HTTP/1.0" 503 299 http://127.0.0.1:8080 2398942 140548664985280
      127.0.0.1 - - [13/Nov/2024:17:50:36 -0500] "POST /helloworld/HelloWorld?sleep=1002 HTTP/1.0" 503 299 http://127.0.0.1:8080 2398901 140548675471040
      127.0.0.1 - - [13/Nov/2024:17:50:36 -0500] "POST /helloworld/HelloWorld?sleep=1002 HTTP/1.0" 503 299 http://127.0.0.1:8080 2398198 140548664985280
      127.0.0.1 - - [13/Nov/2024:17:50:36 -0500] "POST /helloworld/HelloWorld?sleep=1002 HTTP/1.0" 503 299 http://127.0.0.1:8080 2398381 140548675471040
      127.0.0.1 - - [13/Nov/2024:17:50:36 -0500] "POST /helloworld/HelloWorld?sleep=1002 HTTP/1.0" 503 299 http://127.0.0.1:8080 2399166 140548664985280
      127.0.0.1 - - [13/Nov/2024:17:50:36 -0500] "POST /helloworld/HelloWorld?sleep=1002 HTTP/1.0" 503 299 http://127.0.0.1:8080 2398218 140548664985280
      127.0.0.1 - - [13/Nov/2024:17:50:36 -0500] "POST /helloworld/HelloWorld?sleep=1002 HTTP/1.0" 200 470 http://127.0.0.1:8080 2398422 140548664985280
      

      The next STATUS is served by process 2398199 so now 2398199 and 2398422 are the only ones serving 200s:

      127.0.0.1 - - [13/Nov/2024:17:50:45 -0500] "STATUS / HTTP/1.1" 200 70 - 2398199 140548664985280
      ...
      127.0.0.1 - - [13/Nov/2024:17:50:53 -0500] "POST /helloworld/HelloWorld?sleep=1003 HTTP/1.0" 503 299 http://127.0.0.1:8080 2399166 140548664985280
      127.0.0.1 - - [13/Nov/2024:17:50:53 -0500] "POST /helloworld/HelloWorld?sleep=1003 HTTP/1.0" 503 299 http://127.0.0.1:8080 2398377 140548664985280
      127.0.0.1 - - [13/Nov/2024:17:50:53 -0500] "POST /helloworld/HelloWorld?sleep=1003 HTTP/1.0" 503 299 http://127.0.0.1:8080 2398218 140548664985280
      127.0.0.1 - - [13/Nov/2024:17:50:53 -0500] "POST /helloworld/HelloWorld?sleep=1003 HTTP/1.0" 503 299 http://127.0.0.1:8080 2398216 140548664985280
      127.0.0.1 - - [13/Nov/2024:17:50:53 -0500] "POST /helloworld/HelloWorld?sleep=1003 HTTP/1.0" 503 299 http://127.0.0.1:8080 2398189 140548664985280
      127.0.0.1 - - [13/Nov/2024:17:50:53 -0500] "POST /helloworld/HelloWorld?sleep=1003 HTTP/1.0" 503 299 http://127.0.0.1:8080 2398955 140548675471040
      127.0.0.1 - - [13/Nov/2024:17:50:53 -0500] "POST /helloworld/HelloWorld?sleep=1003 HTTP/1.0" 503 299 http://127.0.0.1:8080 2398378 140548675471040
      127.0.0.1 - - [13/Nov/2024:17:50:53 -0500] "POST /helloworld/HelloWorld?sleep=1003 HTTP/1.0" 503 299 http://127.0.0.1:8080 2398381 140548675471040
      127.0.0.1 - - [13/Nov/2024:17:50:53 -0500] "POST /helloworld/HelloWorld?sleep=1003 HTTP/1.0" 503 299 http://127.0.0.1:8080 2398198 140548664985280
      127.0.0.1 - - [13/Nov/2024:17:50:53 -0500] "POST /helloworld/HelloWorld?sleep=1003 HTTP/1.0" 503 299 http://127.0.0.1:8080 2398901 140548675471040
      127.0.0.1 - - [13/Nov/2024:17:50:53 -0500] "POST /helloworld/HelloWorld?sleep=1003 HTTP/1.0" 200 470 http://127.0.0.1:8080 2398422 140548664985280
      127.0.0.1 - - [13/Nov/2024:17:50:53 -0500] "POST /helloworld/HelloWorld?sleep=1003 HTTP/1.0" 200 470 http://127.0.0.1:8080 2398199 140548675471040
      127.0.0.1 - - [13/Nov/2024:17:50:53 -0500] "POST /helloworld/HelloWorld?sleep=1003 HTTP/1.0" 200 470 http://127.0.0.1:8080 2398422 140548675471040
      127.0.0.1 - - [13/Nov/2024:17:50:53 -0500] "POST /helloworld/HelloWorld?sleep=1003 HTTP/1.0" 200 470 http://127.0.0.1:8080 2398199 140548664985280
      

      That trend continues with only new child processes working again after they've served a STATUS request. If you test with a long KeepAliveTimeout, all STATUS requests likely hit the same child process so that other child processes aren't prompted to recover. For the processes that haven't received a STATUS update, they continue to disable the http worker and prolong its 503s on that process:

      [Wed Nov 13 17:51:44.197196 2024] [proxy:debug] [pid 2398955:tid 2398974] proxy_util.c(2301): AH00933: http: too soon to retry worker for (127.0.0.1:8080)
      [Wed Nov 13 17:51:44.197201 2024] [proxy:debug] [pid 2398218:tid 2398241] proxy_util.c(2301): AH00933: http: too soon to retry worker for (127.0.0.1:8080)
      [Wed Nov 13 17:51:44.197202 2024] [proxy:debug] [pid 2398221:tid 2398249] proxy_util.c(2301): AH00933: http: too soon to retry worker for (127.0.0.1:8080)
      [Wed Nov 13 17:51:44.197204 2024] [proxy:error] [pid 2398955:tid 2398974] AH00940: http: disabled connection for (127.0.0.1:8080)
      [Wed Nov 13 17:51:44.197207 2024] [proxy:error] [pid 2398218:tid 2398241] AH00940: http: disabled connection for (127.0.0.1:8080)
      [Wed Nov 13 17:51:44.197210 2024] [proxy:error] [pid 2398221:tid 2398249] AH00940: http: disabled connection for (127.0.0.1:8080)
      [Wed Nov 13 17:51:44.197214 2024] [:debug] [pid 2398955:tid 2398974] mod_proxy_cluster.c(3826): proxy_cluster_post_request 503 for (balancer://mycluster)
      [Wed Nov 13 17:51:44.197219 2024] [:debug] [pid 2398218:tid 2398241] mod_proxy_cluster.c(3826): proxy_cluster_post_request 503 for (balancer://mycluster)
      [Wed Nov 13 17:51:44.197222 2024] [:debug] [pid 2398221:tid 2398249] mod_proxy_cluster.c(3826): proxy_cluster_post_request 503 for (balancer://mycluster)
      

      But if you wait 60+ seconds as in the last reproducer iteration, then all the child processes lift the disabled state so 503s stop.

      Ideally, we need to ensure the http worker recovers across all processes in a timely manner with the first status update.

              rhn-engineering-jclere Jean-Frederic Clere
              rhn-support-aogburn Aaron Ogburn
              Votes:
              0 Vote for this issue
              Watchers:
              3 Start watching this issue

                Created:
                Updated: