-
Bug
-
Resolution: Done
-
Critical
-
None
-
None
-
False
-
-
False
-
-
-
None
Description of problem:
We have 3 clusters. A hub, and 2 managed clusters. We use VolSync to replicate content from the source to the destination cluster. Initially, we create a VolSync ReplicationSource on Cluster1 and a VolSync ReplicationDestination on Cluster2. The sync from Cluster1 to Cluster2 is successful and everything works as expected.
Next, we switch roles by deleting the ReplicationSource on Cluster1 and its associated resources and recreating it on Cluster2. Similarly, the ReplicationDestination is moved from Cluster2 to Cluster1. We are using ClusterIP, and the ReplicationDestination is responsible for creating the ServiceExport and owns it along with the Service. When we delete the ReplicationDestination, to move it to the other cluster, the ServiceExport and the Service automatically get garbage collected.
The issue starts when we move the ReplicationDestination from one Cluster to the other. The syncs stops and the ReplicationSource (from the source cluster) starts failing with a connection error.
Through trial and error, we were able to find the reason for the issue. When inspecting the endpoints and EndpointSlices, we noticed suspicious output, as shown below. After deleting all endpoints and endpointslices in the namespace, everything started working again.
oc get endpointslices NAME ADDRESSTYPE PORTS ENDPOINTS AGE filebrowser-vnbbq IPv4 8080 10.254.24.122 53m volsync-rsync-tls-dst-filebrowser-rootdir-b2w6r IPv4 8000 242.1.255.243 52m volsync-rsync-tls-dst-filebrowser-rootdir-wxpm6 IPv4 8000 242.0.255.244 51m
What is suspicious about the output above is the extra entry for the same namespace for the the endpointslices. Here is the yaml output for them:
oc get endpointslices -o yaml
apiVersion: v1
items:
- addressType: IPv4
apiVersion: discovery.k8s.io/v1
endpoints:
- addresses:
- 10.254.24.122
conditions:
ready: true
serving: true
terminating: false
nodeName: worker1.rdr-blue-site-svl-2.cp.fyre.ibm.com
targetRef:
kind: Pod
name: filebrowser-5bbd4769bd-pjn7r
namespace: fb-helm-mixed-2
uid: 5888ea29-cd7b-4b61-a542-baa4369930e9
kind: EndpointSlice
metadata:
annotations:
endpoints.kubernetes.io/last-change-trigger-time: "2025-02-14T16:35:21Z"
creationTimestamp: "2025-02-14T16:34:35Z"
generateName: filebrowser-
generation: 3
labels:
app.kubernetes.io/instance: filebrowser
app.kubernetes.io/managed-by: Helm
app.kubernetes.io/name: filebrowser
app.kubernetes.io/version: v2.23.0
endpointslice.kubernetes.io/managed-by: endpointslice-controller.k8s.io
helm.sh/chart: filebrowser-1.0.0
kubernetes.io/service-name: filebrowser
velero.io/backup-name: openshift-dr-ops--filebrowser-fb-helm-mixed-2--1--capturec9a355
velero.io/restore-name: openshift-dr-ops--filebrowser-fb-helm-mixed-2--0
name: filebrowser-vnbbq
namespace: fb-helm-mixed-2
ownerReferences:
- apiVersion: v1
blockOwnerDeletion: true
controller: true
kind: Service
name: filebrowser
uid: 23588b39-5628-4e96-9e9f-3a9f6c3c682f
resourceVersion: "80215354"
uid: e94a2740-8c8c-45f7-a434-6a510dadab72
ports:
- name: http
port: 8080
protocol: TCP
- addressType: IPv4
apiVersion: discovery.k8s.io/v1
endpoints:
- addresses:
- 242.1.255.243
conditions:
ready: true
kind: EndpointSlice
metadata:
annotations:
lighthouse.submariner.io/globalnet-enabled: "false"
lighthouse.submariner.io/publish-not-ready-addresses: "false"
creationTimestamp: "2025-02-14T16:35:10Z"
generation: 1
labels:
endpointslice.kubernetes.io/managed-by: lighthouse-agent.submariner.io
lighthouse.submariner.io/is-headless: "false"
lighthouse.submariner.io/sourceNamespace: fb-helm-mixed-2
multicluster.kubernetes.io/service-name: volsync-rsync-tls-dst-filebrowser-rootdir
multicluster.kubernetes.io/source-cluster: rdr-blue-site-svl-2
submariner-io/clusterID: rdr-blue-site-svl-2
submariner-io/originatingNamespace: fb-helm-mixed-2
velero.io/backup-name: openshift-dr-ops--filebrowser-fb-helm-mixed-2--1--capturec9a355
velero.io/restore-name: openshift-dr-ops--filebrowser-fb-helm-mixed-2--0
name: volsync-rsync-tls-dst-filebrowser-rootdir-b2w6r
namespace: fb-helm-mixed-2
resourceVersion: "80215119"
uid: 87fd435a-15e3-41f2-9ac4-558b37508e45
ports:
- name: rsync-tls
port: 8000
protocol: TCP
- addressType: IPv4
apiVersion: discovery.k8s.io/v1
endpoints:
- addresses:
- 242.0.255.244
conditions:
ready: true
kind: EndpointSlice
metadata:
annotations:
lighthouse.submariner.io/globalnet-enabled: "false"
lighthouse.submariner.io/publish-not-ready-addresses: "false"
creationTimestamp: "2025-02-14T16:35:38Z"
generation: 2
labels:
endpointslice.kubernetes.io/managed-by: lighthouse-agent.submariner.io
lighthouse.submariner.io/is-headless: "false"
lighthouse.submariner.io/sourceNamespace: fb-helm-mixed-2
multicluster.kubernetes.io/service-name: volsync-rsync-tls-dst-filebrowser-rootdir
multicluster.kubernetes.io/source-cluster: rdr-blue-site-svl-1
submariner-io/clusterID: rdr-blue-site-svl-1
submariner-io/originatingNamespace: fb-helm-mixed-2
name: volsync-rsync-tls-dst-filebrowser-rootdir-wxpm6
namespace: fb-helm-mixed-2
resourceVersion: "80215610"
uid: 972c6cd2-88ff-461e-a9a6-73826de42a14
ports:
- name: rsync-tls
port: 8000
protocol: TCP
kind: List
metadata:
resourceVersion: ""
Here is the endpoint
oc get endpoints -o yaml
apiVersion: v1
items:
- apiVersion: v1
kind: Endpoints
metadata:
annotations:
endpoints.kubernetes.io/last-change-trigger-time: "2025-02-14T16:35:21Z"
creationTimestamp: "2025-02-14T16:34:35Z"
labels:
app.kubernetes.io/instance: filebrowser
app.kubernetes.io/managed-by: Helm
app.kubernetes.io/name: filebrowser
app.kubernetes.io/version: v2.23.0
helm.sh/chart: filebrowser-1.0.0
velero.io/backup-name: openshift-dr-ops--filebrowser-fb-helm-mixed-2--1--capturec9a355
velero.io/restore-name: openshift-dr-ops--filebrowser-fb-helm-mixed-2--0
name: filebrowser
namespace: fb-helm-mixed-2
resourceVersion: "80215353"
uid: 7ed10262-1821-4538-8bee-1f4348ad5e80
subsets:
- addresses:
- ip: 10.254.24.122
nodeName: worker1.rdr-blue-site-svl-2.cp.fyre.ibm.com
targetRef:
kind: Pod
name: filebrowser-5bbd4769bd-pjn7r
namespace: fb-helm-mixed-2
uid: 5888ea29-cd7b-4b61-a542-baa4369930e9
ports:
- name: http
port: 8080
protocol: TCP
kind: List
metadata:
resourceVersion: ""
More info on Slack: https://redhat-internal.slack.com/archives/C0134E73VH6/p1739540978916319
Version-Release number of selected component (if applicable):
Showing versions COMPONENT REPOSITORY CONFIGURED RUNNING ARCH submariner-gateway quay.io/submariner 0.19.0 release-0.19-63bfdce6ad6e amd64 submariner-routeagent quay.io/submariner 0.19.0 release-0.19-63bfdce6ad6e amd64 submariner-globalnet quay.io/submariner 0.19.0 release-0.19-63bfdce6ad6e amd64 submariner-metrics-proxy quay.io/submariner 0.19.0 release-0.19-35f346829412 amd64 submariner-operator quay.io/submariner 0.19.0 release-0.19-9640c944b134 amd64 submariner-lighthouse-agent quay.io/submariner 0.19.0 release-0.19-4cd926616639 amd64 submariner-lighthouse-coredns quay.io/submariner 0.19.0 release-0.19-4cd926616639 amd64 h4. How reproducible:
How reproducible:
Steps to Reproduce:
- Setup RDR
- Deploy an application that uses Cephfs volumes
- Failover/Relocate the application to the other cluster
Actual results:
After failover, the syn stops.
Expected results:
After failover, the sync should start again from the new source to the new destination.
Additional info:
To workaround the issue, delete endpointslices for the affected namespace.