Loading...

XML

Word

Printable

Type: Feature Request
Resolution: Unresolved
Priority: Major
Fix Version/s: None
Affects Version/s: 2.4
Component/s: automation-mesh, controller
Labels:
- controller
- self-healing

Blocked:
False
Blocked Reason:

Hide

None

Show
None
Ready:
False

SFDC Cases Links:
SFDC Cases Counter:
SFDC Cases Open:

PX Priority Data:
PX Impact Score:

Intelligence Requested:
Market:

This is under a very common scenario where the first instance in a multi-node instance group is the hybrid node who is currently dispatching jobs out to the cluster.

When this first hybrid node becomes very slow e.g. due to an out-of-memory situation or to CPU exhaustion, it often fails to handle its multiple tasks in due time: it will execute jobs slowly, it will process the job queue slowly, it will dispatch jobs slowly. Moreover, receptor may get killed from time to time by the oom-killer if memory-hungry jobs are at play, only for receptor to be restarted right next. As a result, the execution part of this hybrid node will flip-flop into and out of the cluster: it will be added back to the cluster and will receive jobs to execute, only to fall off the cluster a few minutes later due to slow responses.

The end result is messy: the first node being slow will cause the whole cluster to effectively fail to process any jobs or at least most jobs.

This RFE proposes a handful of methods to better handle such slow instances:

On the dispatcher side, each time a task takes X number of seconds to ack/pub (and this gets logged because it's considered a slow response), add the seconds to a point system that each cluster member checks. The higher a node's points, the lower priority it receives on the cluster. After Y number of points within T seconds, the node will be marked offline as it's then considered to be a recurring offender that could slow down the whole cluster.
Then, at a higher level the affected cluster member performs a series of health checks and measures the health checks against a predetermined value. Once those health checks are passing, the member will re-add itself back to the cluster.
Alternatively to the self-health-check suggestion, the "offender" node could simply be put back into the cluster at "the last position" and at a forcibly lower fork count than it originally had. This would serve to give the offender node some slack in terms of job load. After a given interval, a "rehabilitated" cluster member (i.e. one that is back on the cluster and has not shown any failures since re-joining the cluster) might regain its former position as the first instance in the group.

As is hopefully clear from the proposals above, the overarching goal of this RFE is to ensure clusters are operational even if a subset of its members is faulty or slow. Detecting such faults and slowness and handling them appropriately is what this RFE is about.

Assignee:: Brian Coursen

Reporter:: Pablo Hess

Contributors:: Max Mitschke, Pablo Hess

Votes:: 2 Vote for this issue

Watchers:: 3 Start watching this issue

Created:: 2023/11/21 7:50 PM

Updated:: 2024/12/03 3:40 PM

Details

Description

Attachments

Easy Agile Planning Poker

Activity

People

Dates