-
Bug
-
Resolution: Done
-
Critical
-
None
-
None
-
False
-
-
False
-
-
-
None
Description of problem:
We have 3 clusters. A hub, and 2 managed clusters. We use VolSync to replicate content from the source to the destination cluster. Initially, we create a VolSync ReplicationSource on Cluster1 and a VolSync ReplicationDestination on Cluster2. The sync from Cluster1 to Cluster2 is successful and everything works as expected.
Next, we switch roles by deleting the ReplicationSource on Cluster1 and its associated resources and recreating it on Cluster2. Similarly, the ReplicationDestination is moved from Cluster2 to Cluster1. We are using ClusterIP, and the ReplicationDestination is responsible for creating the ServiceExport and owns it along with the Service. When we delete the ReplicationDestination, to move it to the other cluster, the ServiceExport and the Service automatically get garbage collected.
The issue starts when we move the ReplicationDestination from one Cluster to the other. The syncs stops and the ReplicationSource (from the source cluster) starts failing with a connection error.
Through trial and error, we were able to find the reason for the issue. When inspecting the endpoints and EndpointSlices, we noticed suspicious output, as shown below. After deleting all endpoints and endpointslices in the namespace, everything started working again.
oc get endpointslices NAME ADDRESSTYPE PORTS ENDPOINTS AGE filebrowser-vnbbq IPv4 8080 10.254.24.122 53m volsync-rsync-tls-dst-filebrowser-rootdir-b2w6r IPv4 8000 242.1.255.243 52m volsync-rsync-tls-dst-filebrowser-rootdir-wxpm6 IPv4 8000 242.0.255.244 51m
What is suspicious about the output above is the extra entry for the same namespace for the the endpointslices. Here is the yaml output for them:
oc get endpointslices -o yaml apiVersion: v1 items: - addressType: IPv4 apiVersion: discovery.k8s.io/v1 endpoints: - addresses: - 10.254.24.122 conditions: ready: true serving: true terminating: false nodeName: worker1.rdr-blue-site-svl-2.cp.fyre.ibm.com targetRef: kind: Pod name: filebrowser-5bbd4769bd-pjn7r namespace: fb-helm-mixed-2 uid: 5888ea29-cd7b-4b61-a542-baa4369930e9 kind: EndpointSlice metadata: annotations: endpoints.kubernetes.io/last-change-trigger-time: "2025-02-14T16:35:21Z" creationTimestamp: "2025-02-14T16:34:35Z" generateName: filebrowser- generation: 3 labels: app.kubernetes.io/instance: filebrowser app.kubernetes.io/managed-by: Helm app.kubernetes.io/name: filebrowser app.kubernetes.io/version: v2.23.0 endpointslice.kubernetes.io/managed-by: endpointslice-controller.k8s.io helm.sh/chart: filebrowser-1.0.0 kubernetes.io/service-name: filebrowser velero.io/backup-name: openshift-dr-ops--filebrowser-fb-helm-mixed-2--1--capturec9a355 velero.io/restore-name: openshift-dr-ops--filebrowser-fb-helm-mixed-2--0 name: filebrowser-vnbbq namespace: fb-helm-mixed-2 ownerReferences: - apiVersion: v1 blockOwnerDeletion: true controller: true kind: Service name: filebrowser uid: 23588b39-5628-4e96-9e9f-3a9f6c3c682f resourceVersion: "80215354" uid: e94a2740-8c8c-45f7-a434-6a510dadab72 ports: - name: http port: 8080 protocol: TCP - addressType: IPv4 apiVersion: discovery.k8s.io/v1 endpoints: - addresses: - 242.1.255.243 conditions: ready: true kind: EndpointSlice metadata: annotations: lighthouse.submariner.io/globalnet-enabled: "false" lighthouse.submariner.io/publish-not-ready-addresses: "false" creationTimestamp: "2025-02-14T16:35:10Z" generation: 1 labels: endpointslice.kubernetes.io/managed-by: lighthouse-agent.submariner.io lighthouse.submariner.io/is-headless: "false" lighthouse.submariner.io/sourceNamespace: fb-helm-mixed-2 multicluster.kubernetes.io/service-name: volsync-rsync-tls-dst-filebrowser-rootdir multicluster.kubernetes.io/source-cluster: rdr-blue-site-svl-2 submariner-io/clusterID: rdr-blue-site-svl-2 submariner-io/originatingNamespace: fb-helm-mixed-2 velero.io/backup-name: openshift-dr-ops--filebrowser-fb-helm-mixed-2--1--capturec9a355 velero.io/restore-name: openshift-dr-ops--filebrowser-fb-helm-mixed-2--0 name: volsync-rsync-tls-dst-filebrowser-rootdir-b2w6r namespace: fb-helm-mixed-2 resourceVersion: "80215119" uid: 87fd435a-15e3-41f2-9ac4-558b37508e45 ports: - name: rsync-tls port: 8000 protocol: TCP - addressType: IPv4 apiVersion: discovery.k8s.io/v1 endpoints: - addresses: - 242.0.255.244 conditions: ready: true kind: EndpointSlice metadata: annotations: lighthouse.submariner.io/globalnet-enabled: "false" lighthouse.submariner.io/publish-not-ready-addresses: "false" creationTimestamp: "2025-02-14T16:35:38Z" generation: 2 labels: endpointslice.kubernetes.io/managed-by: lighthouse-agent.submariner.io lighthouse.submariner.io/is-headless: "false" lighthouse.submariner.io/sourceNamespace: fb-helm-mixed-2 multicluster.kubernetes.io/service-name: volsync-rsync-tls-dst-filebrowser-rootdir multicluster.kubernetes.io/source-cluster: rdr-blue-site-svl-1 submariner-io/clusterID: rdr-blue-site-svl-1 submariner-io/originatingNamespace: fb-helm-mixed-2 name: volsync-rsync-tls-dst-filebrowser-rootdir-wxpm6 namespace: fb-helm-mixed-2 resourceVersion: "80215610" uid: 972c6cd2-88ff-461e-a9a6-73826de42a14 ports: - name: rsync-tls port: 8000 protocol: TCP kind: List metadata: resourceVersion: ""
Here is the endpoint
oc get endpoints -o yaml apiVersion: v1 items: - apiVersion: v1 kind: Endpoints metadata: annotations: endpoints.kubernetes.io/last-change-trigger-time: "2025-02-14T16:35:21Z" creationTimestamp: "2025-02-14T16:34:35Z" labels: app.kubernetes.io/instance: filebrowser app.kubernetes.io/managed-by: Helm app.kubernetes.io/name: filebrowser app.kubernetes.io/version: v2.23.0 helm.sh/chart: filebrowser-1.0.0 velero.io/backup-name: openshift-dr-ops--filebrowser-fb-helm-mixed-2--1--capturec9a355 velero.io/restore-name: openshift-dr-ops--filebrowser-fb-helm-mixed-2--0 name: filebrowser namespace: fb-helm-mixed-2 resourceVersion: "80215353" uid: 7ed10262-1821-4538-8bee-1f4348ad5e80 subsets: - addresses: - ip: 10.254.24.122 nodeName: worker1.rdr-blue-site-svl-2.cp.fyre.ibm.com targetRef: kind: Pod name: filebrowser-5bbd4769bd-pjn7r namespace: fb-helm-mixed-2 uid: 5888ea29-cd7b-4b61-a542-baa4369930e9 ports: - name: http port: 8080 protocol: TCP kind: List metadata: resourceVersion: ""
More info on Slack: https://redhat-internal.slack.com/archives/C0134E73VH6/p1739540978916319
Version-Release number of selected component (if applicable):
Showing versions COMPONENT REPOSITORY CONFIGURED RUNNING ARCH submariner-gateway quay.io/submariner 0.19.0 release-0.19-63bfdce6ad6e amd64 submariner-routeagent quay.io/submariner 0.19.0 release-0.19-63bfdce6ad6e amd64 submariner-globalnet quay.io/submariner 0.19.0 release-0.19-63bfdce6ad6e amd64 submariner-metrics-proxy quay.io/submariner 0.19.0 release-0.19-35f346829412 amd64 submariner-operator quay.io/submariner 0.19.0 release-0.19-9640c944b134 amd64 submariner-lighthouse-agent quay.io/submariner 0.19.0 release-0.19-4cd926616639 amd64 submariner-lighthouse-coredns quay.io/submariner 0.19.0 release-0.19-4cd926616639 amd64 h4. How reproducible:
How reproducible:
Steps to Reproduce:
- Setup RDR
- Deploy an application that uses Cephfs volumes
- Failover/Relocate the application to the other cluster
Actual results:
After failover, the syn stops.
Expected results:
After failover, the sync should start again from the new source to the new destination.
Additional info:
To workaround the issue, delete endpointslices for the affected namespace.