Details
-
Bug
-
Resolution: Unresolved
-
Undefined
-
None
-
4.13, 4.14
-
None
-
Moderate
-
No
-
False
-
Description
Description of problem
CI is flaky because [sig-arch] events should not repeat pathologically for namespace openshift-monitoring fails with errors such as the following:
{ 5 events happened too frequently event happened 21 times, something is wrong: ns/openshift-monitoring pod/prometheus-k8s-1 hmsg/0a0f2e79b2 - pathological/true reason/FailedAttachVolume AttachVolume.Attach failed for volume "pvc-cf969581-a629-48c8-b8f5-6fe872e3b416" : rpc error: code = FailedPrecondition desc = From: 15:40:19Z To: 15:40:20Z result=reject event happened 22 times, something is wrong: ns/openshift-monitoring pod/prometheus-k8s-1 hmsg/0a0f2e79b2 - pathological/true reason/FailedAttachVolume AttachVolume.Attach failed for volume "pvc-cf969581-a629-48c8-b8f5-6fe872e3b416" : rpc error: code = FailedPrecondition desc = From: 15:40:20Z To: 15:40:21Z result=reject event happened 23 times, something is wrong: ns/openshift-monitoring pod/prometheus-k8s-1 hmsg/0a0f2e79b2 - pathological/true reason/FailedAttachVolume AttachVolume.Attach failed for volume "pvc-cf969581-a629-48c8-b8f5-6fe872e3b416" : rpc error: code = FailedPrecondition desc = From: 15:40:21Z To: 15:40:22Z result=reject event happened 24 times, something is wrong: ns/openshift-monitoring pod/prometheus-k8s-1 hmsg/0a0f2e79b2 - pathological/true reason/FailedAttachVolume AttachVolume.Attach failed for volume "pvc-cf969581-a629-48c8-b8f5-6fe872e3b416" : rpc error: code = FailedPrecondition desc = From: 15:40:22Z To: 15:40:23Z result=reject event happened 25 times, something is wrong: ns/openshift-monitoring pod/prometheus-k8s-1 hmsg/0a0f2e79b2 - pathological/true reason/FailedAttachVolume AttachVolume.Attach failed for volume "pvc-cf969581-a629-48c8-b8f5-6fe872e3b416" : rpc error: code = FailedPrecondition desc = From: 15:40:24Z To: 15:40:25Z result=reject }
This particular failure comes from https://prow.ci.openshift.org/view/gs/origin-ci-test/pr-logs/pull/openshift_cluster-ingress-operator/927/pull-ci-openshift-cluster-ingress-operator-master-e2e-aws-ovn-upgrade/1661372847848689664. Search.ci has other similar failures.
Version-Release number of selected component (if applicable)
I have seen this in 4.13 and 4.14 CI jobs.
How reproducible
Presently, search.ci shows the following stats for the past two days:
Found in 0.00% of runs (0.03% of failures) across 48334 total runs and 5208 jobs (14.76% failed)
openshift-origin-27570-nightly-4.14-e2e-aws-sdn-upgrade (all) - 11 runs, 27% failed, 33% of failures match = 9% impact
periodic-ci-openshift-release-master-nightly-4.13-e2e-aws-sdn-upgrade (all) - 19 runs, 74% failed, 7% of failures match = 5% impact
Steps to Reproduce
1. Post a PR and have bad luck.
2. Check search.ci: https://search.ci.openshift.org/?search=pathological%2Ftrue+reason%2FFailedAttachVolume&maxAge=48h&context=1&type=junit&name=&excludeName=&maxMatches=5&maxBytes=20971520&groupBy=job
Actual results
CI fails.
Expected results
CI passes, or fails on some other test failure.
Additional info
The failures that I have seen are all on AWS. I only checked the past 2 days because search.ci failed to load with a longer time horizon.
Continuing with https://prow.ci.openshift.org/view/gs/origin-ci-test/pr-logs/pull/openshift_cluster-ingress-operator/927/pull-ci-openshift-cluster-ingress-operator-master-e2e-aws-ovn-upgrade/1661372847848689664, events.json shows the following for this PVC:
% jq -r < events.json '.items|.[]|select(.message|contains("pvc-cf969581-a629-48c8-b8f5-6fe872e3b416"))|.metadata.creationTimestamp+" ("+(.count//1|tostring)+"x) "+.reason+": "+.message' 2023-05-24T14:30:48Z (1x) ProvisioningSucceeded: Successfully provisioned volume pvc-cf969581-a629-48c8-b8f5-6fe872e3b416 2023-05-24T14:30:51Z (1x) SuccessfulAttachVolume: AttachVolume.Attach succeeded for volume "pvc-cf969581-a629-48c8-b8f5-6fe872e3b416" 2023-05-24T14:33:48Z (1x) FailedAttachVolume: Multi-Attach error for volume "pvc-cf969581-a629-48c8-b8f5-6fe872e3b416" Volume is already exclusively attached to one node and can't be attached to another 2023-05-24T14:34:18Z (1x) SuccessfulAttachVolume: AttachVolume.Attach succeeded for volume "pvc-cf969581-a629-48c8-b8f5-6fe872e3b416" 2023-05-24T15:37:35Z (1x) FailedAttachVolume: Multi-Attach error for volume "pvc-cf969581-a629-48c8-b8f5-6fe872e3b416" Volume is already exclusively attached to one node and can't be attached to another 2023-05-24T15:39:43Z (25x) FailedAttachVolume: AttachVolume.Attach failed for volume "pvc-cf969581-a629-48c8-b8f5-6fe872e3b416" : rpc error: code = FailedPrecondition desc =
In Loki, I see kube-controller-manager logging the following:
E0524 15:39:43.755125 1 csi_attacher.go:733] kubernetes.io/csi: attachment for vol-0c87fcbafff9c4c64 failed: rpc error: code = FailedPrecondition desc = I0524 15:39:43.755181 1 actual_state_of_world.go:360] "Volume is already added to attachedVolume list to node, update device path" volumeName=kubernetes.io/csi/ebs.csi.aws.com^vol-0c87fcbafff9c4c64 node="ip-10-0-165-4.us-west-2.compute.internal" devicePath="" E0524 15:39:43.755259 1 nestedpendingoperations.go:348] Operation for "{volumeName:kubernetes.io/csi/ebs.csi.aws.com^vol-0c87fcbafff9c4c64 podName: nodeName:}" failed. No retries permitted until 2023-05-24 15:39:44.255235682 +0000 UTC m=+2325.267069021 (durationBeforeRetry 500ms). Error: AttachVolume.Attach failed for volume "pvc-cf969581-a629-48c8-b8f5-6fe872e3b416" (UniqueName: "kubernetes.io/csi/ebs.csi.aws.com^vol-0c87fcbafff9c4c64") from node "ip-10-0-165-4.us-west-2.compute.internal" : rpc error: code = FailedPrecondition desc = I0524 15:39:43.755353 1 event.go:307] "Event occurred" object="openshift-monitoring/prometheus-k8s-1" fieldPath="" kind="Pod" apiVersion="v1" type="Warning" reason="FailedAttachVolume" message="AttachVolume.Attach failed for volume \"pvc-cf969581-a629-48c8-b8f5-6fe872e3b416\" : rpc error: code = FailedPrecondition desc = " I0524 15:39:43.805404 1 reconciler.go:273] "attacherDetacher.DetachVolume started" volume={AttachedVolume:{VolumeName:kubernetes.io/csi/ebs.csi.aws.com^vol-0c87fcbafff9c4c64 VolumeSpec:0xc004747f68 NodeName:ip-10-0-145-90.us-west-2.compute.internal PluginIsAttachable:true DevicePath: DeviceMountPath: PluginName: SELinuxMountContext:} MountedByNode:false DetachRequestedTime:2023-05-24 15:38:36.637403574 +0000 UTC m=+2257.649236914} I0524 15:39:43.809079 1 operation_generator.go:1626] Verified volume is safe to detach for volume "pvc-cf969581-a629-48c8-b8f5-6fe872e3b416" (UniqueName: "kubernetes.io/csi/ebs.csi.aws.com^vol-0c87fcbafff9c4c64") on node "ip-10-0-145-90.us-west-2.compute.internal"
Besides this failure, search.ci shows two matches for pathological/true reason/FailedAttachVolume in the past 2 days: