Uploaded image for project: 'Data Foundation Bugs'
  1. Data Foundation Bugs
  2. DFBUGS-429

[2321994] [RDR] Noobaa S3 becomes unreachable after a few days hence backup stops for discovered apps

XMLWordPrintable

    • False
    • Hide

      None

      Show
      None
    • False
    • ?
    • ?
    • If docs needed, set a value
    • None

      Description of problem (please be detailed as possible and provide log
      snippests):

      Version of all relevant components (if applicable):
      OCP 4.17.0-0.nightly-2024-10-20-231827
      ODF 4.17.0-126
      ACM 2.12.0-DOWNSTREAM-2024-10-18-21-57-41
      OpenShift Virtualization 4.17.1-19
      Submariner 0.19 unreleased downstream image 846949
      ceph version 18.2.1-229.el9cp (ef652b206f2487adfc86613646a4cac946f6b4e0) reef (stable)
      OADP 1.4.1
      OpenShift GitOps 1.14.0
      VolSync 0.10.1

      Does this issue impact your ability to continue to work with the product
      (please explain in detail what is the user impact)?

      Is there any workaround available to the best of your knowledge? After restart of all noobaa pods on both C1 and C2 ODF clusters, backup resumed.

      Rate from 1 - 5 the complexity of the scenario you performed that caused this
      bug (1 - very simple, 5 - very complex)?

      Can this issue reproducible?

      Can this issue reproduce from the UI?

      If this is a regression, please provide more details to justify this:

      Steps to Reproduce:
      1. On a ODF Regional DR setup, deploy a CNV workload for discovered app using data volume template from https://github.com/RamenDR/ocm-ramen-samples/tree/main/workloads/kubevirt/vm-dvt/odr-regional.
      2. Create a snapshot of the pvc and restore it as a PVC.
      3. Delete the snapshot and the workload except the data volume and PVC.
      4. Create the workload again in a way that it now consumes the existing snapshot-restored PVC already available. The VM should use this PVC and not create a new one.
      5. Repeat the above steps to clone a PVC instead of snapshot in another namespace for another CNV workload.
      6. DR protect these workloads with a unique label on the required resources such as VM, Datavolume, PVC and secret to be backed up to odrbucket created by ramen on the primary managed cluster.
      7. During DR protection, ensure backups are being done every 5mins and noobaa s3 is accessible.
      8. Run IOs for a few days (4-5 days in this case) and check if regular backups are being taken and Noobaa S3 remains accessible.

      Actual results: [RDR] Noobaa S3 becomes unreachable after a few days hence backup stops for discovered apps

      Backup stopped somewhere on 25th OCT 2024 while s3 was accessible before that.
      The cluster was idle during this time. No node related or any other operation was performed by me (or at least it's unknown to me).

      Must gather logs from the setup- http://rhsqe-repo.lab.eng.blr.redhat.com/OCS/ocs-qe-bugs/bz-aman/26oct24/
      where C1 and C2 are the ODF clusters

      Hub/RHACM- where RHACM is installed but not the ODF

      From C1-

      s3cmd ls
      ERROR: Error parsing xml: Malformed error XML returned from remote server.. ErrorXML: b"<html><body><h1>504 Gateway Time-out</h1>\nThe server didn't respond in time.\n</body></html>\n"
      WARNING: Retrying failed request: / (504 (Gateway Time-out))
      WARNING: Waiting 3 sec...
      ^CSee ya!

      However, all the Noobaa pods were up and running on both the managed clusters.

      Expected results: Noobaa s3 should remain accessible even on long running setups.

      Additional info:

              belimele@redhat.com Ben Eli
              amagrawa@redhat.com Aman Agrawal
              Krishnaram Karthick Ramdoss Krishnaram Karthick Ramdoss
              Votes:
              0 Vote for this issue
              Watchers:
              9 Start watching this issue

                Created:
                Updated: