Loading...

XML

Word

Printable

Details

Type: Bug
Resolution: Unresolved
Priority: Undefined
Fix Version/s: None
Affects Version/s: 4.13, 4.14
Component/s: Storage / Kubernetes
Labels:
None

Severity:
Moderate
Regression:
No
Blocked:
False
Blocked Reason:

Hide

None

Show
None

SFDC Cases Counter:
SFDC Cases Links:

Description

Description of problem

CI is flaky because [sig-arch] events should not repeat pathologically for namespace openshift-monitoring fails with errors such as the following:

{  5 events happened too frequently

event happened 21 times, something is wrong: ns/openshift-monitoring pod/prometheus-k8s-1 hmsg/0a0f2e79b2 - pathological/true reason/FailedAttachVolume AttachVolume.Attach failed for volume "pvc-cf969581-a629-48c8-b8f5-6fe872e3b416" : rpc error: code = FailedPrecondition desc =  From: 15:40:19Z To: 15:40:20Z result=reject 
event happened 22 times, something is wrong: ns/openshift-monitoring pod/prometheus-k8s-1 hmsg/0a0f2e79b2 - pathological/true reason/FailedAttachVolume AttachVolume.Attach failed for volume "pvc-cf969581-a629-48c8-b8f5-6fe872e3b416" : rpc error: code = FailedPrecondition desc =  From: 15:40:20Z To: 15:40:21Z result=reject 
event happened 23 times, something is wrong: ns/openshift-monitoring pod/prometheus-k8s-1 hmsg/0a0f2e79b2 - pathological/true reason/FailedAttachVolume AttachVolume.Attach failed for volume "pvc-cf969581-a629-48c8-b8f5-6fe872e3b416" : rpc error: code = FailedPrecondition desc =  From: 15:40:21Z To: 15:40:22Z result=reject 
event happened 24 times, something is wrong: ns/openshift-monitoring pod/prometheus-k8s-1 hmsg/0a0f2e79b2 - pathological/true reason/FailedAttachVolume AttachVolume.Attach failed for volume "pvc-cf969581-a629-48c8-b8f5-6fe872e3b416" : rpc error: code = FailedPrecondition desc =  From: 15:40:22Z To: 15:40:23Z result=reject 
event happened 25 times, something is wrong: ns/openshift-monitoring pod/prometheus-k8s-1 hmsg/0a0f2e79b2 - pathological/true reason/FailedAttachVolume AttachVolume.Attach failed for volume "pvc-cf969581-a629-48c8-b8f5-6fe872e3b416" : rpc error: code = FailedPrecondition desc =  From: 15:40:24Z To: 15:40:25Z result=reject }

This particular failure comes from https://prow.ci.openshift.org/view/gs/origin-ci-test/pr-logs/pull/openshift_cluster-ingress-operator/927/pull-ci-openshift-cluster-ingress-operator-master-e2e-aws-ovn-upgrade/1661372847848689664. Search.ci has other similar failures.

Version-Release number of selected component (if applicable)

I have seen this in 4.13 and 4.14 CI jobs.

How reproducible

Presently, search.ci shows the following stats for the past two days:

Found in 0.00% of runs (0.03% of failures) across 48334 total runs and 5208 jobs (14.76% failed)

openshift-origin-27570-nightly-4.14-e2e-aws-sdn-upgrade (all) - 11 runs, 27% failed, 33% of failures match = 9% impact

periodic-ci-openshift-release-master-nightly-4.13-e2e-aws-sdn-upgrade (all) - 19 runs, 74% failed, 7% of failures match = 5% impact

Steps to Reproduce

1. Post a PR and have bad luck.
2. Check search.ci: https://search.ci.openshift.org/?search=pathological%2Ftrue+reason%2FFailedAttachVolume&maxAge=48h&context=1&type=junit&name=&excludeName=&maxMatches=5&maxBytes=20971520&groupBy=job

Actual results

CI fails.

Expected results

CI passes, or fails on some other test failure.

Additional info

The failures that I have seen are all on AWS. I only checked the past 2 days because search.ci failed to load with a longer time horizon.

Continuing with https://prow.ci.openshift.org/view/gs/origin-ci-test/pr-logs/pull/openshift_cluster-ingress-operator/927/pull-ci-openshift-cluster-ingress-operator-master-e2e-aws-ovn-upgrade/1661372847848689664, events.json shows the following for this PVC:

% jq -r < events.json '.items|.[]|select(.message|contains("pvc-cf969581-a629-48c8-b8f5-6fe872e3b416"))|.metadata.creationTimestamp+" ("+(.count//1|tostring)+"x) "+.reason+": "+.message'  
2023-05-24T14:30:48Z (1x) ProvisioningSucceeded: Successfully provisioned volume pvc-cf969581-a629-48c8-b8f5-6fe872e3b416
2023-05-24T14:30:51Z (1x) SuccessfulAttachVolume: AttachVolume.Attach succeeded for volume "pvc-cf969581-a629-48c8-b8f5-6fe872e3b416" 
2023-05-24T14:33:48Z (1x) FailedAttachVolume: Multi-Attach error for volume "pvc-cf969581-a629-48c8-b8f5-6fe872e3b416" Volume is already exclusively attached to one node and can't be attached to another
2023-05-24T14:34:18Z (1x) SuccessfulAttachVolume: AttachVolume.Attach succeeded for volume "pvc-cf969581-a629-48c8-b8f5-6fe872e3b416" 
2023-05-24T15:37:35Z (1x) FailedAttachVolume: Multi-Attach error for volume "pvc-cf969581-a629-48c8-b8f5-6fe872e3b416" Volume is already exclusively attached to one node and can't be attached to another
2023-05-24T15:39:43Z (25x) FailedAttachVolume: AttachVolume.Attach failed for volume "pvc-cf969581-a629-48c8-b8f5-6fe872e3b416" : rpc error: code = FailedPrecondition desc =

In Loki, I see kube-controller-manager logging the following:

E0524 15:39:43.755125       1 csi_attacher.go:733] kubernetes.io/csi: attachment for vol-0c87fcbafff9c4c64 failed: rpc error: code = FailedPrecondition desc = 
I0524 15:39:43.755181       1 actual_state_of_world.go:360] "Volume is already added to attachedVolume list to node, update device path" volumeName=kubernetes.io/csi/ebs.csi.aws.com^vol-0c87fcbafff9c4c64 node="ip-10-0-165-4.us-west-2.compute.internal" devicePath=""
E0524 15:39:43.755259       1 nestedpendingoperations.go:348] Operation for "{volumeName:kubernetes.io/csi/ebs.csi.aws.com^vol-0c87fcbafff9c4c64 podName: nodeName:}" failed. No retries permitted until 2023-05-24 15:39:44.255235682 +0000 UTC m=+2325.267069021 (durationBeforeRetry 500ms). Error: AttachVolume.Attach failed for volume "pvc-cf969581-a629-48c8-b8f5-6fe872e3b416" (UniqueName: "kubernetes.io/csi/ebs.csi.aws.com^vol-0c87fcbafff9c4c64") from node "ip-10-0-165-4.us-west-2.compute.internal" : rpc error: code = FailedPrecondition desc = 
I0524 15:39:43.755353       1 event.go:307] "Event occurred" object="openshift-monitoring/prometheus-k8s-1" fieldPath="" kind="Pod" apiVersion="v1" type="Warning" reason="FailedAttachVolume" message="AttachVolume.Attach failed for volume \"pvc-cf969581-a629-48c8-b8f5-6fe872e3b416\" : rpc error: code = FailedPrecondition desc = "
I0524 15:39:43.805404       1 reconciler.go:273] "attacherDetacher.DetachVolume started" volume={AttachedVolume:{VolumeName:kubernetes.io/csi/ebs.csi.aws.com^vol-0c87fcbafff9c4c64 VolumeSpec:0xc004747f68 NodeName:ip-10-0-145-90.us-west-2.compute.internal PluginIsAttachable:true DevicePath: DeviceMountPath: PluginName: SELinuxMountContext:} MountedByNode:false DetachRequestedTime:2023-05-24 15:38:36.637403574 +0000 UTC m=+2257.649236914}
I0524 15:39:43.809079       1 operation_generator.go:1626] Verified volume is safe to detach for volume "pvc-cf969581-a629-48c8-b8f5-6fe872e3b416" (UniqueName: "kubernetes.io/csi/ebs.csi.aws.com^vol-0c87fcbafff9c4c64") on node "ip-10-0-145-90.us-west-2.compute.internal"

Besides this failure, search.ci shows two matches for pathological/true reason/FailedAttachVolume in the past 2 days:

Attachments

Activity

People

Assignee:: Hemant Kumar

Reporter:: Miciah Masters

QA Contact:: Penghao Wang

Votes:: 0 Vote for this issue

Watchers:: 4 Start watching this issue

Dates

Created:: 2023/06/01 2:04 AM

Updated:: 2024/02/29 12:56 PM