Loading...

This issue belongs to an archived project. You can view it, but you can't modify it. Learn more

XML

Word

Printable

Type: Bug
Resolution: Obsolete
Priority: Major
Fix Version/s: None
Affects Version/s: 7.2.3.Final
Component/s: Core, Cross-Site Replication
Labels:
None

There are 3 phases in a backup RPC:

1. Sender -> Local site master: caused by the site master is shutting down or crashing, or by a network split.
2. Local site master -> Remote site master:
2.1. Local site master is no longer a site master, e.g. because it's shutting down or because it's no longer coordinator after a merge.
2.2. Remote site master is not longer a site master.
2.3. Link between local site and remote site is down.
3. Remote site master -> Backup targets

Replication failures in phase 3 are handled by retrying (except for TimeoutExceptions), because BaseBackupReceiver uses regular cache methods to perform the updates.

But replication failures in phases 1 and 2 are not handled in any way, except for causing the remote site to be taken offline after a certain number of replication failures (if backup is synchronous). We should instead retry backup RPCs when we get a SuspectException or UnreachableException, and perhaps even when we get no response (2.2?), and only stop when the timeout expires or when the backup is taken offline.

Async backup probably needs retrying as well, and perhaps even a more sophisticated approach like I-RAC (~~ISPN-2634~~).

is related to

ISPN-2634 Implement cross-site replication based on I-RAC (Reliably Asynchronous Clustering)

Closed

relates to

JGRP-1927 RELAY2: Delays during shutdown

Resolved

Assignee:: Pedro Ruivo

Reporter:: Dan Berindei (Inactive)

Archiver:: Amol Dongare

Created:: 2015/06/22 6:49 AM

Updated:: 2023/05/25 1:41 PM

Resolved:: 2023/05/25 1:41 PM

Archived:: 2024/11/28 6:21 AM

Details

Description

Attachments

Issue Links

Easy Agile Planning Poker

Activity

People

Dates

PagerDuty