Description of problem: When a remote node connection resource with non-zero reconnect-interval attribute is configured to prefer running on a certain cluster node and that preferred node becomes unable to host the remote connection resource, then after the reconnect-interval expires, the cluster attempts to migrate the connection resource back to its preferred node. If the preferred node is still unable to start the connection resource, the migration operation fails and remote node gets fenced.
If migration of a remote node fails, but the connection can be maintained at the old location, no fencing should be scheduled.
Moreover, the remote connection resource currently remains stopped after the migration failure until the next cluster-recheck-interval, which might be a scheduler bug. When the failure is recorded, the next scheduler run should schedule all appropriate actions needed.
Version-Release number of selected component (if applicable): pacemaker-2.0.3-4.el8
Steps to Reproduce:
- Configure a 2-node cluster with a third Pacemaker Remote node
- Set "reconnect-interval=120s" and "meta migration-threshold=1" attributes on the remote connection resource
- Make sure cluster-recheck-interval is much longer than the above reconnect-interval (eg. 15min, the default)
- Create a location constraint for the remote connection resource to prefer running on a particular cluster node
- Verify the remote connection resource is started on its preferred node
- Block the network connection between preferred node and remote node (eg. block the outgoing connection in the preferred node's firewall)
Actual results: The remote connection resource first moves to the less-desirable cluster node and after reconnect-interval (2m) tries to migrate back to its preferred cluster node (unsuccessfully), resulting in remote node getting fenced and remote connection resource remaining stopped until the next cluster-recheck-interval fires.
Expected results: Migration operation failure should not be fatal, remote node should never be fenced or marked unclean (ie. providing uninterrupted service)
- links to