There are 3 phases in a backup RPC:
1. Sender -> Local site master: caused by the site master is shutting down or crashing, or by a network split.
2. Local site master -> Remote site master:
2.1. Local site master is no longer a site master, e.g. because it's shutting down or because it's no longer coordinator after a merge.
2.2. Remote site master is not longer a site master.
2.3. Link between local site and remote site is down.
3. Remote site master -> Backup targets
Replication failures in phase 3 are handled by retrying (except for TimeoutExceptions), because BaseBackupReceiver uses regular cache methods to perform the updates.
But replication failures in phases 1 and 2 are not handled in any way, except for causing the remote site to be taken offline after a certain number of replication failures (if backup is synchronous). We should instead retry backup RPCs when we get a SuspectException or UnreachableException, and perhaps even when we get no response (2.2?), and only stop when the timeout expires or when the backup is taken offline.
Async backup probably needs retrying as well, and perhaps even a more sophisticated approach like I-RAC (