-
Bug
-
Resolution: Done
-
Major
-
None
This is a mirror of BZ https://bugzilla.redhat.com/show_bug.cgi?id=2083353, copying the content of BZ here for reference
Description of problem (please be detailed as possible and provide log
snippests):
We have encountered consistent hanging of CSI backups via Ceph on OpenShift v4.10.3. We have successfully taken 5 backups with Ceph CSI snapshots for 34+ PVCs via OADP v1.0.2 but the next backup always get hang indefinitely in the velero-plugin-for-csi plugin. There are no errors in the velero log and the backup was stuck in InProgress state. The problem is consistent and repeatable on another OpenShift v4.10 cluster. (Kubernetes Version: v1.23.3+e419edf)
The last log message from that plugin in the velero log was:
time="2022-04-16T00:15:21Z" level=info msg="Executing custom action" backup=oadp/ckpt-backup-1f logSource="pkg/backup/item_backupper.go:327" name=velero-zookeeper-data-zookeeper-0-2pjzv namespace=wkc resource=volumesnapshots.snapshot.storage.k8s.io
time="2022-04-16T00:15:21Z" level=info msg="Executing VolumeSnapshotBackupItemAction" backup=oadp/ckpt-backup-1f cmd=/plugins/velero-plugin-for-csi logSource="/remote-source/app/internal/backup/volumesnapshot_action.go:58" pluginName=velero-plugin-for-csi
time="2022-04-16T00:15:21Z" level=info msg="Getting VolumesnapshotContent for Volumesnapshot wkc/velero-zookeeper-data-zookeeper-0-2pjzv" backup=oadp/ckpt-backup-1f cmd=/plugins/velero-plugin-for-csi logSource="/remote-source/app/internal/backup/volumesnapshot_action.go:91" pluginName=velero-plugin-for-csi
Version of all relevant components (if applicable):
OADP version: v1.0.2
Velero version: v1.7.1
velero-plugin-for-csi version: 0.2.0
oc version
Client Version: 4.10.3
Server Version: 4.10.3
Kubernetes Version: v1.23.3+e419edf
Does this issue impact your ability to continue to work with the product
(please explain in detail what is the user impact)?
Unable to perform further backups
Is there any workaround available to the best of your knowledge?
See explanation below:
I debug this further and got a stack dump of the plugin velero-plugin-for-csi gRPC server and it appears that it was in IO wait state for that goroutine(shown below). It was writing deprecated warnings from client-go via klog.Warning() which normally writes to stderr. The warning messages were related to VolumeSnapshot and VolumeSnapshotContent being deprecated. (These warning messages were in a pipe, see explanation below)
W0422 22:59:43.573612 66770 warnings.go:70] snapshot.storage.k8s.io/v1beta1 VolumeSnapshot is deprecated; use snapshot.storage.k8s.io/v1 VolumeSnapshot
W0422 22:55:37.607548 66737 warnings.go:70] snapshot.storage.k8s.io/v1beta1 VolumeSnapshotContent is deprecated; use snapshot.storage.k8s.io/v1 VolumeSnapshotContent
This seems strange at first that the goroutine is blocked writing to stderr. After digging deeper into the go-plugin code(my understanding is that velero uses go-plugin v1.0.1 as of this writing), it turns out that the go-plugin framework sets both the os.stdout and os.stderr to a pipe in the Serve() function.
https://github.com/hashicorp/go-plugin/blob/9e3e1c37db188a1acb66561ee0ed4bf4d5e77554/server.go#L350
However, for go-plugin v1.0.1, I don't see any code that consumes from those pipes which explains why eventually the plugin was blocked since it reached the pipe's max capacity with the accumulated warning messages. I went through the go-plugin git issues and found this thread comment which confirms my understanding:
hashicorp/go-plugin#147 (comment)
It appears this is a known issue and was addressed in post go-plugin v1.0.1.
This issue can easily occur when multiple CSI backups are created over time and eventually the backup will get stuck and not be able to perform further backups on newer k8s systems. Updating the csi plugin to use snapshot.storage.k8s.io/v1 or a custom rest.WarningHandler may be some of the alternatives to workaround this but nevertheless, this looks like a general Velero plugin issue that its processing may potentially hang due to the above findings and needs to be addressed at the plugin framework level.
I have reported this for Velero at
https://github.com/vmware-tanzu/velero/issues/4863
and they have targeted fix for v1.9.0. What is the plan for OADP?
Rate from 1 - 5 the complexity of the scenario you performed that caused this
bug (1 - very simple, 5 - very complex)?
2
Can this issue reproducible?
Yes
Can this issue reproduce from the UI?
If this is a regression, please provide more details to justify this:
Steps to Reproduce:
1. Perform multiple CSI snapshot backups using OADP on OpenShift 4.10
2. Eventually the backup will hang
Actual results:
Backup hangs indefinitely
Expected results:
Backup should complete or return error without hanging.
Additional info:
goroutine 1154 [IO wait, 34 minutes]:
internal/poll.runtime_pollWait(0x7fc4a63fc2a0, 0x77, 0xffffffffffffffff)
/usr/local/go/src/runtime/netpoll.go:227 +0x55
internal/poll.(*pollDesc).wait(0xc000246cd8, 0x77, 0x1, 0xf5, 0xffffffffffffffff)
/usr/local/go/src/internal/poll/fd_poll_runtime.go:87 +0x45
internal/poll.(*pollDesc).waitWrite(...)
/usr/local/go/src/internal/poll/fd_poll_runtime.go:96
internal/poll.(*FD).Write(0xc000246cc0, 0xc000878000, 0xa8, 0xf5, 0x0, 0x0, 0x0)
/usr/local/go/src/internal/poll/fd_unix.go:282 +0x325
os.(*File).write(...)
/usr/local/go/src/os/file_posix.go:48
os.(*File).Write(0xc000342038, 0xc000878000, 0xa8, 0xf5, 0xc000936a00, 0x4, 0xc0009beb90)
/usr/local/go/src/os/file.go:174 +0x8e
k8s.io/klog/v2.(*loggingT).output(0x266eaa0, 0xc000000001, 0x0, 0x0, 0xc000876000, 0x1, 0x1f1a7d4, 0xb, 0x46, 0x40e000)
/go/pkg/mod/k8s.io/klog/
v2@v2.9.0
/klog.go:924 +0x905
k8s.io/klog/v2.(*loggingT).printDepth(0x266eaa0, 0xc000000001, 0x0, 0x0, 0x0, 0x0, 0x1, 0xc0003ec050, 0x1, 0x1)
/go/pkg/mod/k8s.io/klog/
v2@v2.9.0
/klog.go:735 +0x185
k8s.io/klog/v2.(*loggingT).print(...)
/go/pkg/mod/k8s.io/klog/
v2@v2.9.0
/klog.go:717
k8s.io/klog/v2.Warning(...)
/go/pkg/mod/k8s.io/klog/
v2@v2.9.0
/klog.go:1426
k8s.io/client-go/rest.WarningLogger.HandleWarningHeader(0x12b, 0xc000392d84, 0x1, 0xc000990280, 0x79)
/go/pkg/mod/k8s.io/
client-go@v0.22.2
/rest/warnings.go:70 +0xfe
k8s.io/client-go/rest.handleWarnings(0xc00017e060, 0x1bd71c0, 0x269de80, 0xc00017e0c0, 0x1bd7120, 0xc0003ec030)
/go/pkg/mod/k8s.io/
client-go@v0.22.2
/rest/warnings.go:144 +0x16d
k8s.io/client-go/rest.(*Request).transformResponse(0xc000510500, 0xc00080afc0, 0xc000510700, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, ...)
/go/pkg/mod/k8s.io/
client-go@v0.22.2
/rest/request.go:1148 +0x178
k8s.io/client-go/rest.(*Request).Do.func1(0xc000510700, 0xc00080afc0)
/go/pkg/mod/k8s.io/
client-go@v0.22.2
/rest/request.go:1039 +0x65
k8s.io/client-go/rest.(*Request).request.func2.1(...)
/go/pkg/mod/k8s.io/
client-go@v0.22.2
/rest/request.go:996
k8s.io/client-go/rest.(*Request).request.func2(0xc00080afc0, 0xc0009bf388, 0xc000510500, 0xc000510700, 0x0, 0x0, 0xc0009bf290, 0xc0009bf360, 0x0)
/go/pkg/mod/k8s.io/
client-go@v0.22.2
/rest/request.go:1021 +0x175
k8s.io/client-go/rest.(*Request).request(0xc000510500, 0x1c07a70, 0xc000040058, 0xc0009bf388, 0x0, 0x0)
/go/pkg/mod/k8s.io/
client-go@v0.22.2
/rest/request.go:1023 +0x55d
k8s.io/client-go/rest.(*Request).Do(0xc000510500, 0x1c07a70, 0xc000040058, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, ...)
/go/pkg/mod/k8s.io/
client-go@v0.22.2
/rest/request.go:1038 +0xf1
github.com/kubernetes-csi/external-snapshotter/client/v4/clientset/versioned/typed/volumesnapshot/v1beta1.(*volumeSnapshotContents).Get(0xc000c63f20, 0x1c07a70, 0xc000040058, 0xc00128a720, 0x30, 0x0, 0x0, 0x0, 0x0, 0x0, ...)
/go/pkg/mod/github.com/kubernetes-csi/external-snapshotter/client/
v4@v4.0.0
/clientset/versioned/typed/volumesnapshot/v1beta1/volumesnapshotcontent.go:72 +0x193
github.com/vmware-tanzu/velero-plugin-for-csi/internal/util.GetVolumeSnapshotContentForVolumeSnapshot(0xc00022c3c0, 0x1c07c30, 0xc00032b6b0, 0x1c23f70, 0xc0001e9d50, 0x0, 0x199, 0x0, 0x0)
/go/src/velero-plugin-for-csi/internal/util/util.go:165 +0x47b
github.com/vmware-tanzu/velero-plugin-for-csi/internal/backup.(*VolumeSnapshotBackupItemAction).Execute(0xc0001580a0, 0x1c15448, 0xc0003420b8, 0xc0009ce280, 0xc0009ce280, 0x0, 0x0, 0xc000a3e2a0, 0xc000a3e2a8, 0x0, ...)
/go/src/velero-plugin-for-csi/internal/backup/volumesnapshot_action.go:92 +0x519
github.com/vmware-tanzu/velero/pkg/plugin/framework.(*BackupItemActionGRPCServer).Execute(0xc00000e068, 0x1c07ae0, 0xc00078a930, 0xc0007c8780, 0x0, 0x0, 0x0)
/go/pkg/mod/github.com/vmware-tanzu/
velero@v1.7.0
/pkg/plugin/framework/backup_item_action_server.go:101 +0x425
github.com/vmware-tanzu/velero/pkg/plugin/generated._BackupItemAction_Execute_Handler(0x17d6820, 0xc00000e068, 0x1c07ae0, 0xc00078a930, 0xc000edc4e0, 0x0, 0x1c07ae0, 0xc00078a930, 0xc0009d6000, 0x1065)
/go/pkg/mod/github.com/vmware-tanzu/
velero@v1.7.0
/pkg/plugin/generated/BackupItemAction.pb.go:258 +0x214
google.golang.org/grpc.(*Server).processUnaryRPC(0xc000398380, 0x1c18798, 0xc000582a80, 0xc000510400, 0xc00038ca20, 0x264f198, 0x0, 0x0, 0x0)
/go/pkg/mod/google.golang.org/
grpc@v1.31.0
/server.go:1180 +0x52b
google.golang.org/grpc.(*Server).handleStream(0xc000398380, 0x1c18798, 0xc000582a80, 0xc000510400, 0x0)
/go/pkg/mod/google.golang.org/
grpc@v1.31.0
/server.go:1503 +0xd0c
google.golang.org/grpc.(*Server).serveStreams.func1.2(0xc000790070, 0xc000398380, 0x1c18798, 0xc000582a80, 0xc000510400)
/go/pkg/mod/google.golang.org/
grpc@v1.31.0
/server.go:843 +0xab
created by google.golang.org/grpc.(*Server).serveStreams.func1
/go/pkg/mod/google.golang.org/
grpc@v1.31.0
/server.go:841 +0x1fd