Uploaded image for project: 'mod_cluster'
  1. mod_cluster
  2. MODCLUSTER-732

mod_cluster never removes hung JVM that has requests routed to it

XMLWordPrintable

    • Hide

      An easy way to recreate such a hung JVM that is never removed is suspending it like so:

      kill -STOP $PID
      
      Show
      An easy way to recreate such a hung JVM that is never removed is suspending it like so: kill -STOP $PID
    • Undefined

      If a backend JVM is entirely hung (socket still listening, but no requests ever processed, no STATUS MCMPs ever sent), then mod_cluster does not handle it well currently as traffic is never routed off the bad instance and the bad instance is never removed from the balancer.

      In such a state, requests always persistently timeout, but this doesn't put the balancer member in an error state so requests continue to it. Periodic pings may be attempted and will fail, but that does not stop requests to the problem instance. After 60 ping failures, the node could be removed, but the logic here is problematic as any attempted request (which still times out) results in the failure count being reset:

                  if (elected == oldelected) {
      ...
                  } else
                      ou->mess.num_failure_idle = 0;
      

      So at least any continually failing request attempts should not result in the ping failure count being reset and preventing the node removal. We may also consider preventing any requests to a JVM if its pings are currently failing.

              rhn-engineering-jclere Jean-Frederic Clere
              rhn-support-aogburn Aaron Ogburn
              Paul Lodge Paul Lodge
              Votes:
              0 Vote for this issue
              Watchers:
              4 Start watching this issue

                Created:
                Updated:
                Resolved: