-
Feature Request
-
Resolution: Unresolved
-
Major
-
None
-
2.4
-
False
-
-
False
This is under a very common scenario where the first instance in a multi-node instance group is the hybrid node who is currently dispatching jobs out to the cluster.
When this first hybrid node becomes very slow e.g. due to an out-of-memory situation or to CPU exhaustion, it often fails to handle its multiple tasks in due time: it will execute jobs slowly, it will process the job queue slowly, it will dispatch jobs slowly. Moreover, receptor may get killed from time to time by the oom-killer if memory-hungry jobs are at play, only for receptor to be restarted right next. As a result, the execution part of this hybrid node will flip-flop into and out of the cluster: it will be added back to the cluster and will receive jobs to execute, only to fall off the cluster a few minutes later due to slow responses.
The end result is messy: the first node being slow will cause the whole cluster to effectively fail to process any jobs or at least most jobs.
This RFE proposes a handful of methods to better handle such slow instances:
- On the dispatcher side, each time a task takes X number of seconds to ack/pub (and this gets logged because it's considered a slow response), add the seconds to a point system that each cluster member checks. The higher a node's points, the lower priority it receives on the cluster. After Y number of points within T seconds, the node will be marked offline as it's then considered to be a recurring offender that could slow down the whole cluster.
- Then, at a higher level the affected cluster member performs a series of health checks and measures the health checks against a predetermined value. Once those health checks are passing, the member will re-add itself back to the cluster.
- Alternatively to the self-health-check suggestion, the "offender" node could simply be put back into the cluster at "the last position" and at a forcibly lower fork count than it originally had. This would serve to give the offender node some slack in terms of job load. After a given interval, a "rehabilitated" cluster member (i.e. one that is back on the cluster and has not shown any failures since re-joining the cluster) might regain its former position as the first instance in the group.
As is hopefully clear from the proposals above, the overarching goal of this RFE is to ensure clusters are operational even if a subset of its members is faulty or slow. Detecting such faults and slowness and handling them appropriately is what this RFE is about.