Uploaded image for project: 'OpenShift Bugs'
  1. OpenShift Bugs
  2. OCPBUGS-14400

CI fails on "events should not repeat pathologically", reason "FailedAttachVolume"

    XMLWordPrintable

Details

    • Bug
    • Resolution: Unresolved
    • Undefined
    • None
    • 4.13, 4.14
    • Storage / Kubernetes
    • None
    • Moderate
    • No
    • False
    • Hide

      None

      Show
      None

    Description

      Description of problem

      CI is flaky because [sig-arch] events should not repeat pathologically for namespace openshift-monitoring fails with errors such as the following:

      {  5 events happened too frequently
      
      event happened 21 times, something is wrong: ns/openshift-monitoring pod/prometheus-k8s-1 hmsg/0a0f2e79b2 - pathological/true reason/FailedAttachVolume AttachVolume.Attach failed for volume "pvc-cf969581-a629-48c8-b8f5-6fe872e3b416" : rpc error: code = FailedPrecondition desc =  From: 15:40:19Z To: 15:40:20Z result=reject 
      event happened 22 times, something is wrong: ns/openshift-monitoring pod/prometheus-k8s-1 hmsg/0a0f2e79b2 - pathological/true reason/FailedAttachVolume AttachVolume.Attach failed for volume "pvc-cf969581-a629-48c8-b8f5-6fe872e3b416" : rpc error: code = FailedPrecondition desc =  From: 15:40:20Z To: 15:40:21Z result=reject 
      event happened 23 times, something is wrong: ns/openshift-monitoring pod/prometheus-k8s-1 hmsg/0a0f2e79b2 - pathological/true reason/FailedAttachVolume AttachVolume.Attach failed for volume "pvc-cf969581-a629-48c8-b8f5-6fe872e3b416" : rpc error: code = FailedPrecondition desc =  From: 15:40:21Z To: 15:40:22Z result=reject 
      event happened 24 times, something is wrong: ns/openshift-monitoring pod/prometheus-k8s-1 hmsg/0a0f2e79b2 - pathological/true reason/FailedAttachVolume AttachVolume.Attach failed for volume "pvc-cf969581-a629-48c8-b8f5-6fe872e3b416" : rpc error: code = FailedPrecondition desc =  From: 15:40:22Z To: 15:40:23Z result=reject 
      event happened 25 times, something is wrong: ns/openshift-monitoring pod/prometheus-k8s-1 hmsg/0a0f2e79b2 - pathological/true reason/FailedAttachVolume AttachVolume.Attach failed for volume "pvc-cf969581-a629-48c8-b8f5-6fe872e3b416" : rpc error: code = FailedPrecondition desc =  From: 15:40:24Z To: 15:40:25Z result=reject }
      

      This particular failure comes from https://prow.ci.openshift.org/view/gs/origin-ci-test/pr-logs/pull/openshift_cluster-ingress-operator/927/pull-ci-openshift-cluster-ingress-operator-master-e2e-aws-ovn-upgrade/1661372847848689664. Search.ci has other similar failures.

      Version-Release number of selected component (if applicable)

      I have seen this in 4.13 and 4.14 CI jobs.

      How reproducible

      Presently, search.ci shows the following stats for the past two days:

      Found in 0.00% of runs (0.03% of failures) across 48334 total runs and 5208 jobs (14.76% failed)
      
      openshift-origin-27570-nightly-4.14-e2e-aws-sdn-upgrade (all) - 11 runs, 27% failed, 33% of failures match = 9% impact
      
      periodic-ci-openshift-release-master-nightly-4.13-e2e-aws-sdn-upgrade (all) - 19 runs, 74% failed, 7% of failures match = 5% impact
      

      Steps to Reproduce

      1. Post a PR and have bad luck.
      2. Check search.ci: https://search.ci.openshift.org/?search=pathological%2Ftrue+reason%2FFailedAttachVolume&maxAge=48h&context=1&type=junit&name=&excludeName=&maxMatches=5&maxBytes=20971520&groupBy=job

      Actual results

      CI fails.

      Expected results

      CI passes, or fails on some other test failure.

      Additional info

      The failures that I have seen are all on AWS. I only checked the past 2 days because search.ci failed to load with a longer time horizon.

      Continuing with https://prow.ci.openshift.org/view/gs/origin-ci-test/pr-logs/pull/openshift_cluster-ingress-operator/927/pull-ci-openshift-cluster-ingress-operator-master-e2e-aws-ovn-upgrade/1661372847848689664, events.json shows the following for this PVC:

      % jq -r < events.json '.items|.[]|select(.message|contains("pvc-cf969581-a629-48c8-b8f5-6fe872e3b416"))|.metadata.creationTimestamp+" ("+(.count//1|tostring)+"x) "+.reason+": "+.message'  
      2023-05-24T14:30:48Z (1x) ProvisioningSucceeded: Successfully provisioned volume pvc-cf969581-a629-48c8-b8f5-6fe872e3b416
      2023-05-24T14:30:51Z (1x) SuccessfulAttachVolume: AttachVolume.Attach succeeded for volume "pvc-cf969581-a629-48c8-b8f5-6fe872e3b416" 
      2023-05-24T14:33:48Z (1x) FailedAttachVolume: Multi-Attach error for volume "pvc-cf969581-a629-48c8-b8f5-6fe872e3b416" Volume is already exclusively attached to one node and can't be attached to another
      2023-05-24T14:34:18Z (1x) SuccessfulAttachVolume: AttachVolume.Attach succeeded for volume "pvc-cf969581-a629-48c8-b8f5-6fe872e3b416" 
      2023-05-24T15:37:35Z (1x) FailedAttachVolume: Multi-Attach error for volume "pvc-cf969581-a629-48c8-b8f5-6fe872e3b416" Volume is already exclusively attached to one node and can't be attached to another
      2023-05-24T15:39:43Z (25x) FailedAttachVolume: AttachVolume.Attach failed for volume "pvc-cf969581-a629-48c8-b8f5-6fe872e3b416" : rpc error: code = FailedPrecondition desc = 
      

      In Loki, I see kube-controller-manager logging the following:

      E0524 15:39:43.755125       1 csi_attacher.go:733] kubernetes.io/csi: attachment for vol-0c87fcbafff9c4c64 failed: rpc error: code = FailedPrecondition desc = 
      I0524 15:39:43.755181       1 actual_state_of_world.go:360] "Volume is already added to attachedVolume list to node, update device path" volumeName=kubernetes.io/csi/ebs.csi.aws.com^vol-0c87fcbafff9c4c64 node="ip-10-0-165-4.us-west-2.compute.internal" devicePath=""
      E0524 15:39:43.755259       1 nestedpendingoperations.go:348] Operation for "{volumeName:kubernetes.io/csi/ebs.csi.aws.com^vol-0c87fcbafff9c4c64 podName: nodeName:}" failed. No retries permitted until 2023-05-24 15:39:44.255235682 +0000 UTC m=+2325.267069021 (durationBeforeRetry 500ms). Error: AttachVolume.Attach failed for volume "pvc-cf969581-a629-48c8-b8f5-6fe872e3b416" (UniqueName: "kubernetes.io/csi/ebs.csi.aws.com^vol-0c87fcbafff9c4c64") from node "ip-10-0-165-4.us-west-2.compute.internal" : rpc error: code = FailedPrecondition desc = 
      I0524 15:39:43.755353       1 event.go:307] "Event occurred" object="openshift-monitoring/prometheus-k8s-1" fieldPath="" kind="Pod" apiVersion="v1" type="Warning" reason="FailedAttachVolume" message="AttachVolume.Attach failed for volume \"pvc-cf969581-a629-48c8-b8f5-6fe872e3b416\" : rpc error: code = FailedPrecondition desc = "
      I0524 15:39:43.805404       1 reconciler.go:273] "attacherDetacher.DetachVolume started" volume={AttachedVolume:{VolumeName:kubernetes.io/csi/ebs.csi.aws.com^vol-0c87fcbafff9c4c64 VolumeSpec:0xc004747f68 NodeName:ip-10-0-145-90.us-west-2.compute.internal PluginIsAttachable:true DevicePath: DeviceMountPath: PluginName: SELinuxMountContext:} MountedByNode:false DetachRequestedTime:2023-05-24 15:38:36.637403574 +0000 UTC m=+2257.649236914}
      I0524 15:39:43.809079       1 operation_generator.go:1626] Verified volume is safe to detach for volume "pvc-cf969581-a629-48c8-b8f5-6fe872e3b416" (UniqueName: "kubernetes.io/csi/ebs.csi.aws.com^vol-0c87fcbafff9c4c64") on node "ip-10-0-145-90.us-west-2.compute.internal" 
      

      Besides this failure, search.ci shows two matches for pathological/true reason/FailedAttachVolume in the past 2 days:

      Attachments

        Activity

          People

            hekumar@redhat.com Hemant Kumar
            mmasters1@redhat.com Miciah Masters
            Penghao Wang Penghao Wang
            Votes:
            0 Vote for this issue
            Watchers:
            4 Start watching this issue

            Dates

              Created:
              Updated: