Uploaded image for project: 'Red Hat Advanced Cluster Management'
  1. Red Hat Advanced Cluster Management
  2. ACM-18010

VolSync Sync Fails After Failover Due to Stale EndpointSlices in Submariner Environment

XMLWordPrintable

    • False
    • Hide

      None

      Show
      None
    • False
    • None

      Description of problem:

      We have 3 clusters. A hub, and 2 managed clusters. We use VolSync to replicate content from the source to the destination cluster. Initially, we create a VolSync ReplicationSource on Cluster1 and a VolSync ReplicationDestination on Cluster2. The sync from Cluster1 to Cluster2 is successful and everything works as expected.
      Next, we switch roles by deleting the ReplicationSource on Cluster1 and its associated resources and recreating it on Cluster2. Similarly, the ReplicationDestination is moved from Cluster2 to Cluster1. We are using ClusterIP, and the ReplicationDestination is responsible for creating the ServiceExport and owns it along with the Service. When we delete the ReplicationDestination, to move it to the other cluster, the ServiceExport and the Service automatically get garbage collected.

      The issue starts when we move the ReplicationDestination from one Cluster to the other. The syncs stops and the ReplicationSource (from the source cluster) starts failing with a connection error.

      Through trial and error, we were able to find the reason for the issue. When inspecting the endpoints and EndpointSlices, we noticed suspicious output, as shown below. After deleting all endpoints and endpointslices in the namespace, everything started working again.

      oc get endpointslices        
      NAME                                              ADDRESSTYPE   PORTS   ENDPOINTS       AGE
      filebrowser-vnbbq                                 IPv4          8080    10.254.24.122   53m
      volsync-rsync-tls-dst-filebrowser-rootdir-b2w6r   IPv4          8000    242.1.255.243   52m
      volsync-rsync-tls-dst-filebrowser-rootdir-wxpm6   IPv4          8000    242.0.255.244   51m
      

      What is suspicious about the output above is the extra entry for the same namespace for the the endpointslices. Here is the yaml output for them:

      oc get endpointslices -o yaml
      apiVersion: v1
      items:
      - addressType: IPv4
        apiVersion: discovery.k8s.io/v1
        endpoints:
        - addresses:
          - 10.254.24.122
          conditions:
            ready: true
            serving: true
            terminating: false
          nodeName: worker1.rdr-blue-site-svl-2.cp.fyre.ibm.com
          targetRef:
            kind: Pod
            name: filebrowser-5bbd4769bd-pjn7r
            namespace: fb-helm-mixed-2
            uid: 5888ea29-cd7b-4b61-a542-baa4369930e9
        kind: EndpointSlice
        metadata:
          annotations:
            endpoints.kubernetes.io/last-change-trigger-time: "2025-02-14T16:35:21Z"
          creationTimestamp: "2025-02-14T16:34:35Z"
          generateName: filebrowser-
          generation: 3
          labels:
            app.kubernetes.io/instance: filebrowser
            app.kubernetes.io/managed-by: Helm
            app.kubernetes.io/name: filebrowser
            app.kubernetes.io/version: v2.23.0
            endpointslice.kubernetes.io/managed-by: endpointslice-controller.k8s.io
            helm.sh/chart: filebrowser-1.0.0
            kubernetes.io/service-name: filebrowser
            velero.io/backup-name: openshift-dr-ops--filebrowser-fb-helm-mixed-2--1--capturec9a355
            velero.io/restore-name: openshift-dr-ops--filebrowser-fb-helm-mixed-2--0
          name: filebrowser-vnbbq
          namespace: fb-helm-mixed-2
          ownerReferences:
          - apiVersion: v1
            blockOwnerDeletion: true
            controller: true
            kind: Service
            name: filebrowser
            uid: 23588b39-5628-4e96-9e9f-3a9f6c3c682f
          resourceVersion: "80215354"
          uid: e94a2740-8c8c-45f7-a434-6a510dadab72
        ports:
        - name: http
          port: 8080
          protocol: TCP
      - addressType: IPv4
        apiVersion: discovery.k8s.io/v1
        endpoints:
        - addresses:
          - 242.1.255.243
          conditions:
            ready: true
        kind: EndpointSlice
        metadata:
          annotations:
            lighthouse.submariner.io/globalnet-enabled: "false"
            lighthouse.submariner.io/publish-not-ready-addresses: "false"
          creationTimestamp: "2025-02-14T16:35:10Z"
          generation: 1
          labels:
            endpointslice.kubernetes.io/managed-by: lighthouse-agent.submariner.io
            lighthouse.submariner.io/is-headless: "false"
            lighthouse.submariner.io/sourceNamespace: fb-helm-mixed-2
            multicluster.kubernetes.io/service-name: volsync-rsync-tls-dst-filebrowser-rootdir
            multicluster.kubernetes.io/source-cluster: rdr-blue-site-svl-2
            submariner-io/clusterID: rdr-blue-site-svl-2
            submariner-io/originatingNamespace: fb-helm-mixed-2
            velero.io/backup-name: openshift-dr-ops--filebrowser-fb-helm-mixed-2--1--capturec9a355
            velero.io/restore-name: openshift-dr-ops--filebrowser-fb-helm-mixed-2--0
          name: volsync-rsync-tls-dst-filebrowser-rootdir-b2w6r
          namespace: fb-helm-mixed-2
          resourceVersion: "80215119"
          uid: 87fd435a-15e3-41f2-9ac4-558b37508e45
        ports:
        - name: rsync-tls
          port: 8000
          protocol: TCP
      - addressType: IPv4
        apiVersion: discovery.k8s.io/v1
        endpoints:
        - addresses:
          - 242.0.255.244
          conditions:
            ready: true
        kind: EndpointSlice
        metadata:
          annotations:
            lighthouse.submariner.io/globalnet-enabled: "false"
            lighthouse.submariner.io/publish-not-ready-addresses: "false"
          creationTimestamp: "2025-02-14T16:35:38Z"
          generation: 2
          labels:
            endpointslice.kubernetes.io/managed-by: lighthouse-agent.submariner.io
            lighthouse.submariner.io/is-headless: "false"
            lighthouse.submariner.io/sourceNamespace: fb-helm-mixed-2
            multicluster.kubernetes.io/service-name: volsync-rsync-tls-dst-filebrowser-rootdir
            multicluster.kubernetes.io/source-cluster: rdr-blue-site-svl-1
            submariner-io/clusterID: rdr-blue-site-svl-1
            submariner-io/originatingNamespace: fb-helm-mixed-2
          name: volsync-rsync-tls-dst-filebrowser-rootdir-wxpm6
          namespace: fb-helm-mixed-2
          resourceVersion: "80215610"
          uid: 972c6cd2-88ff-461e-a9a6-73826de42a14
        ports:
        - name: rsync-tls
          port: 8000
          protocol: TCP
      kind: List
      metadata:
        resourceVersion: ""
      

      Here is the endpoint

      oc get endpoints -o yaml
      apiVersion: v1
      items:
      - apiVersion: v1
        kind: Endpoints
        metadata:
          annotations:
            endpoints.kubernetes.io/last-change-trigger-time: "2025-02-14T16:35:21Z"
          creationTimestamp: "2025-02-14T16:34:35Z"
          labels:
            app.kubernetes.io/instance: filebrowser
            app.kubernetes.io/managed-by: Helm
            app.kubernetes.io/name: filebrowser
            app.kubernetes.io/version: v2.23.0
            helm.sh/chart: filebrowser-1.0.0
            velero.io/backup-name: openshift-dr-ops--filebrowser-fb-helm-mixed-2--1--capturec9a355
            velero.io/restore-name: openshift-dr-ops--filebrowser-fb-helm-mixed-2--0
          name: filebrowser
          namespace: fb-helm-mixed-2
          resourceVersion: "80215353"
          uid: 7ed10262-1821-4538-8bee-1f4348ad5e80
        subsets:
        - addresses:
          - ip: 10.254.24.122
            nodeName: worker1.rdr-blue-site-svl-2.cp.fyre.ibm.com
            targetRef:
              kind: Pod
              name: filebrowser-5bbd4769bd-pjn7r
              namespace: fb-helm-mixed-2
              uid: 5888ea29-cd7b-4b61-a542-baa4369930e9
          ports:
          - name: http
            port: 8080
            protocol: TCP
      kind: List
      metadata:
        resourceVersion: ""
      

      More info on Slack: https://redhat-internal.slack.com/archives/C0134E73VH6/p1739540978916319

      Version-Release number of selected component (if applicable):

       Showing versions
      COMPONENT                       REPOSITORY           CONFIGURED   RUNNING                     ARCH
      submariner-gateway              quay.io/submariner   0.19.0       release-0.19-63bfdce6ad6e   amd64
      submariner-routeagent           quay.io/submariner   0.19.0       release-0.19-63bfdce6ad6e   amd64
      submariner-globalnet            quay.io/submariner   0.19.0       release-0.19-63bfdce6ad6e   amd64
      submariner-metrics-proxy        quay.io/submariner   0.19.0       release-0.19-35f346829412   amd64
      submariner-operator             quay.io/submariner   0.19.0       release-0.19-9640c944b134   amd64
      submariner-lighthouse-agent     quay.io/submariner   0.19.0       release-0.19-4cd926616639   amd64
      submariner-lighthouse-coredns   quay.io/submariner   0.19.0       release-0.19-4cd926616639   amd64
      h4. How reproducible:
      

      How reproducible:

      Steps to Reproduce:

      1. Setup RDR
      2. Deploy an application that uses Cephfs volumes
      3. Failover/Relocate the application to the other cluster

      Actual results:

      After failover, the syn stops.

      Expected results:

      After failover, the sync should start again from the new source to the new destination.

      Additional info: 

      To workaround the issue, delete endpointslices for the affected namespace.

              tpanteli Thomas Pantelis
              bmekhiss Benamar Mekhissi
              Prachi Yadav Prachi Yadav
              Votes:
              0 Vote for this issue
              Watchers:
              6 Start watching this issue

                Created:
                Updated:
                Resolved: