Uploaded image for project: 'OpenShift API for Data Protection'
  1. OpenShift API for Data Protection
  2. OADP-533

OADP Backup via Ceph CSI snapshot hangs indefinitely on OpenShift v4.10

XMLWordPrintable

    • Icon: Bug Bug
    • Resolution: Done
    • Icon: Major Major
    • OADP 1.1.0
    • None
    • velero
    • False
    • Hide

      None

      Show
      None
    • False
    • Passed
    • 0
    • 0
    • Very Likely
    • 0
    • None
    • Unset
    • Unknown
    • No

      This is a mirror of BZ https://bugzilla.redhat.com/show_bug.cgi?id=2083353, copying the content of BZ here for reference

       
      Description of problem (please be detailed as possible and provide log
      snippests):

      We have encountered consistent hanging of CSI backups via Ceph on OpenShift v4.10.3. We have successfully taken 5 backups with Ceph CSI snapshots for 34+ PVCs via OADP v1.0.2 but the next backup always get hang indefinitely in the velero-plugin-for-csi plugin. There are no errors in the velero log and the backup was stuck in InProgress state. The problem is consistent and repeatable on another OpenShift v4.10 cluster. (Kubernetes Version: v1.23.3+e419edf)

      The last log message from that plugin in the velero log was:

      time="2022-04-16T00:15:21Z" level=info msg="Executing custom action" backup=oadp/ckpt-backup-1f logSource="pkg/backup/item_backupper.go:327" name=velero-zookeeper-data-zookeeper-0-2pjzv namespace=wkc resource=volumesnapshots.snapshot.storage.k8s.io
      time="2022-04-16T00:15:21Z" level=info msg="Executing VolumeSnapshotBackupItemAction" backup=oadp/ckpt-backup-1f cmd=/plugins/velero-plugin-for-csi logSource="/remote-source/app/internal/backup/volumesnapshot_action.go:58" pluginName=velero-plugin-for-csi
      time="2022-04-16T00:15:21Z" level=info msg="Getting VolumesnapshotContent for Volumesnapshot wkc/velero-zookeeper-data-zookeeper-0-2pjzv" backup=oadp/ckpt-backup-1f cmd=/plugins/velero-plugin-for-csi logSource="/remote-source/app/internal/backup/volumesnapshot_action.go:91" pluginName=velero-plugin-for-csi

      Version of all relevant components (if applicable):
      OADP version: v1.0.2
      Velero version: v1.7.1
      velero-plugin-for-csi version: 0.2.0

      oc version
      Client Version: 4.10.3
      Server Version: 4.10.3
      Kubernetes Version: v1.23.3+e419edf

      Does this issue impact your ability to continue to work with the product
      (please explain in detail what is the user impact)?
      Unable to perform further backups

      Is there any workaround available to the best of your knowledge?
      See explanation below:

      I debug this further and got a stack dump of the plugin velero-plugin-for-csi gRPC server and it appears that it was in IO wait state for that goroutine(shown below). It was writing deprecated warnings from client-go via klog.Warning() which normally writes to stderr. The warning messages were related to VolumeSnapshot and VolumeSnapshotContent being deprecated. (These warning messages were in a pipe, see explanation below)

      W0422 22:59:43.573612 66770 warnings.go:70] snapshot.storage.k8s.io/v1beta1 VolumeSnapshot is deprecated; use snapshot.storage.k8s.io/v1 VolumeSnapshot
      W0422 22:55:37.607548 66737 warnings.go:70] snapshot.storage.k8s.io/v1beta1 VolumeSnapshotContent is deprecated; use snapshot.storage.k8s.io/v1 VolumeSnapshotContent

      This seems strange at first that the goroutine is blocked writing to stderr. After digging deeper into the go-plugin code(my understanding is that velero uses go-plugin v1.0.1 as of this writing), it turns out that the go-plugin framework sets both the os.stdout and os.stderr to a pipe in the Serve() function.
      https://github.com/hashicorp/go-plugin/blob/9e3e1c37db188a1acb66561ee0ed4bf4d5e77554/server.go#L350
      However, for go-plugin v1.0.1, I don't see any code that consumes from those pipes which explains why eventually the plugin was blocked since it reached the pipe's max capacity with the accumulated warning messages. I went through the go-plugin git issues and found this thread comment which confirms my understanding:

      hashicorp/go-plugin#147 (comment)

      It appears this is a known issue and was addressed in post go-plugin v1.0.1.

      This issue can easily occur when multiple CSI backups are created over time and eventually the backup will get stuck and not be able to perform further backups on newer k8s systems. Updating the csi plugin to use snapshot.storage.k8s.io/v1 or a custom rest.WarningHandler may be some of the alternatives to workaround this but nevertheless, this looks like a general Velero plugin issue that its processing may potentially hang due to the above findings and needs to be addressed at the plugin framework level.

      I have reported this for Velero at
      https://github.com/vmware-tanzu/velero/issues/4863
      and they have targeted fix for v1.9.0. What is the plan for OADP?

      Rate from 1 - 5 the complexity of the scenario you performed that caused this
      bug (1 - very simple, 5 - very complex)?
      2

      Can this issue reproducible?
      Yes

      Can this issue reproduce from the UI?

      If this is a regression, please provide more details to justify this:

      Steps to Reproduce:
      1. Perform multiple CSI snapshot backups using OADP on OpenShift 4.10
      2. Eventually the backup will hang

      Actual results:
      Backup hangs indefinitely

      Expected results:
      Backup should complete or return error without hanging.

      Additional info:

      goroutine 1154 [IO wait, 34 minutes]:
      internal/poll.runtime_pollWait(0x7fc4a63fc2a0, 0x77, 0xffffffffffffffff)
      /usr/local/go/src/runtime/netpoll.go:227 +0x55
      internal/poll.(*pollDesc).wait(0xc000246cd8, 0x77, 0x1, 0xf5, 0xffffffffffffffff)
      /usr/local/go/src/internal/poll/fd_poll_runtime.go:87 +0x45
      internal/poll.(*pollDesc).waitWrite(...)
      /usr/local/go/src/internal/poll/fd_poll_runtime.go:96
      internal/poll.(*FD).Write(0xc000246cc0, 0xc000878000, 0xa8, 0xf5, 0x0, 0x0, 0x0)
      /usr/local/go/src/internal/poll/fd_unix.go:282 +0x325
      os.(*File).write(...)
      /usr/local/go/src/os/file_posix.go:48
      os.(*File).Write(0xc000342038, 0xc000878000, 0xa8, 0xf5, 0xc000936a00, 0x4, 0xc0009beb90)
      /usr/local/go/src/os/file.go:174 +0x8e
      k8s.io/klog/v2.(*loggingT).output(0x266eaa0, 0xc000000001, 0x0, 0x0, 0xc000876000, 0x1, 0x1f1a7d4, 0xb, 0x46, 0x40e000)
      /go/pkg/mod/k8s.io/klog/
      v2@v2.9.0
      /klog.go:924 +0x905
      k8s.io/klog/v2.(*loggingT).printDepth(0x266eaa0, 0xc000000001, 0x0, 0x0, 0x0, 0x0, 0x1, 0xc0003ec050, 0x1, 0x1)
      /go/pkg/mod/k8s.io/klog/
      v2@v2.9.0
      /klog.go:735 +0x185
      k8s.io/klog/v2.(*loggingT).print(...)
      /go/pkg/mod/k8s.io/klog/
      v2@v2.9.0
      /klog.go:717
      k8s.io/klog/v2.Warning(...)
      /go/pkg/mod/k8s.io/klog/
      v2@v2.9.0
      /klog.go:1426
      k8s.io/client-go/rest.WarningLogger.HandleWarningHeader(0x12b, 0xc000392d84, 0x1, 0xc000990280, 0x79)
      /go/pkg/mod/k8s.io/
      client-go@v0.22.2
      /rest/warnings.go:70 +0xfe
      k8s.io/client-go/rest.handleWarnings(0xc00017e060, 0x1bd71c0, 0x269de80, 0xc00017e0c0, 0x1bd7120, 0xc0003ec030)
      /go/pkg/mod/k8s.io/
      client-go@v0.22.2
      /rest/warnings.go:144 +0x16d
      k8s.io/client-go/rest.(*Request).transformResponse(0xc000510500, 0xc00080afc0, 0xc000510700, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, ...)
      /go/pkg/mod/k8s.io/
      client-go@v0.22.2
      /rest/request.go:1148 +0x178
      k8s.io/client-go/rest.(*Request).Do.func1(0xc000510700, 0xc00080afc0)
      /go/pkg/mod/k8s.io/
      client-go@v0.22.2
      /rest/request.go:1039 +0x65
      k8s.io/client-go/rest.(*Request).request.func2.1(...)
      /go/pkg/mod/k8s.io/
      client-go@v0.22.2
      /rest/request.go:996
      k8s.io/client-go/rest.(*Request).request.func2(0xc00080afc0, 0xc0009bf388, 0xc000510500, 0xc000510700, 0x0, 0x0, 0xc0009bf290, 0xc0009bf360, 0x0)
      /go/pkg/mod/k8s.io/
      client-go@v0.22.2
      /rest/request.go:1021 +0x175
      k8s.io/client-go/rest.(*Request).request(0xc000510500, 0x1c07a70, 0xc000040058, 0xc0009bf388, 0x0, 0x0)
      /go/pkg/mod/k8s.io/
      client-go@v0.22.2
      /rest/request.go:1023 +0x55d
      k8s.io/client-go/rest.(*Request).Do(0xc000510500, 0x1c07a70, 0xc000040058, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, ...)
      /go/pkg/mod/k8s.io/
      client-go@v0.22.2
      /rest/request.go:1038 +0xf1
      github.com/kubernetes-csi/external-snapshotter/client/v4/clientset/versioned/typed/volumesnapshot/v1beta1.(*volumeSnapshotContents).Get(0xc000c63f20, 0x1c07a70, 0xc000040058, 0xc00128a720, 0x30, 0x0, 0x0, 0x0, 0x0, 0x0, ...)
      /go/pkg/mod/github.com/kubernetes-csi/external-snapshotter/client/
      v4@v4.0.0
      /clientset/versioned/typed/volumesnapshot/v1beta1/volumesnapshotcontent.go:72 +0x193
      github.com/vmware-tanzu/velero-plugin-for-csi/internal/util.GetVolumeSnapshotContentForVolumeSnapshot(0xc00022c3c0, 0x1c07c30, 0xc00032b6b0, 0x1c23f70, 0xc0001e9d50, 0x0, 0x199, 0x0, 0x0)
      /go/src/velero-plugin-for-csi/internal/util/util.go:165 +0x47b
      github.com/vmware-tanzu/velero-plugin-for-csi/internal/backup.(*VolumeSnapshotBackupItemAction).Execute(0xc0001580a0, 0x1c15448, 0xc0003420b8, 0xc0009ce280, 0xc0009ce280, 0x0, 0x0, 0xc000a3e2a0, 0xc000a3e2a8, 0x0, ...)
      /go/src/velero-plugin-for-csi/internal/backup/volumesnapshot_action.go:92 +0x519
      github.com/vmware-tanzu/velero/pkg/plugin/framework.(*BackupItemActionGRPCServer).Execute(0xc00000e068, 0x1c07ae0, 0xc00078a930, 0xc0007c8780, 0x0, 0x0, 0x0)
      /go/pkg/mod/github.com/vmware-tanzu/
      velero@v1.7.0
      /pkg/plugin/framework/backup_item_action_server.go:101 +0x425
      github.com/vmware-tanzu/velero/pkg/plugin/generated._BackupItemAction_Execute_Handler(0x17d6820, 0xc00000e068, 0x1c07ae0, 0xc00078a930, 0xc000edc4e0, 0x0, 0x1c07ae0, 0xc00078a930, 0xc0009d6000, 0x1065)
      /go/pkg/mod/github.com/vmware-tanzu/
      velero@v1.7.0
      /pkg/plugin/generated/BackupItemAction.pb.go:258 +0x214
      google.golang.org/grpc.(*Server).processUnaryRPC(0xc000398380, 0x1c18798, 0xc000582a80, 0xc000510400, 0xc00038ca20, 0x264f198, 0x0, 0x0, 0x0)
      /go/pkg/mod/google.golang.org/
      grpc@v1.31.0
      /server.go:1180 +0x52b
      google.golang.org/grpc.(*Server).handleStream(0xc000398380, 0x1c18798, 0xc000582a80, 0xc000510400, 0x0)
      /go/pkg/mod/google.golang.org/
      grpc@v1.31.0
      /server.go:1503 +0xd0c
      google.golang.org/grpc.(*Server).serveStreams.func1.2(0xc000790070, 0xc000398380, 0x1c18798, 0xc000582a80, 0xc000510400)
      /go/pkg/mod/google.golang.org/
      grpc@v1.31.0
      /server.go:843 +0xab
      created by google.golang.org/grpc.(*Server).serveStreams.func1
      /go/pkg/mod/google.golang.org/
      grpc@v1.31.0
      /server.go:841 +0x1fd

              sseago Scott Seago
              mrajanna@redhat.com Madhu R
              Tzahi Ashkenazi Tzahi Ashkenazi
              Votes:
              1 Vote for this issue
              Watchers:
              10 Start watching this issue

                Created:
                Updated:
                Resolved: