Uploaded image for project: 'OpenShift Bugs'
  1. OpenShift Bugs
  2. OCPBUGS-65895

[AWS EFS CSI Driver Operator] efs_tags_controller failed to refresh cached credentials on hypershift hosted cluster

XMLWordPrintable

    • Icon: Bug Bug
    • Resolution: Unresolved
    • Icon: Major Major
    • 4.21
    • 4.21, 4.20.z
    • Storage / Operators
    • None
    • None
    • False
    • Hide

      None

      Show
      None
    • None
    • None
    • None
    • None
    • Proposed
    • None
    • None
    • None
    • None
    • None
    • None
    • None
    • None

      This is a clone of issue OCPBUGS-65858. The following is the description of the original issue:

      Description of problem:

      [AWS EFS CSI Driver Operator] efs_tags_controller failed to refresh cached credentials on hypershift hosted cluster    

      Version-Release number of selected component (if applicable):

       4.21.0-0-2025-11-19-092506-test-ci-ln-mm7xb0b-latest   

      How reproducible:

      Always    

      Steps to Reproduce:

          1. Install a hypershift hosted cluster on AWS.
          2. Install the efs csi driver operator and dirver on the hosted cluster.
          3. Add annotation to the aws-efs-csi-driver-operator sa to make sure the efs_tags_controller could access the "web_identity_token_file".
          $ och patch sa/aws-efs-csi-driver-operator -n openshift-cluster-csi-drivers --type='merge' -p "{\"metadata\": {\"annotations\": {\"eks.amazonaws.com/role-arn\": \"arn:aws:iam::301721915996:role/hypershift-ci-362507-efs-operator-role\", \"eks.amazonaws.com/audience\": \"sts.amazonaws.com\"}}}"
          4. Create a pvc with the efs storageclass and wait for the pvc Bound with pv(the accesspoint provisioned).
          5. Patch the hc with extra tags.
          $ oc patch -n clusters hostedcluster/hypershift-ci-362507 -p {"spec":{"platform":{"aws":{"resourceTags":[{"key":"ocp_storage_owned","value":"true"}]}}}} --type=merge'
          6. Check the pv(accesspoint) should sync with the new tags(waiting for the efs_tags_controller sync the tags))
      
           

      Actual results:

         In step6,  the efs_tags_controller sync the tags failed of E1120 19:21:11.678829       1 aws_efs_tags_controller.go:256] Error updating tags for PV fs-01e68e58f4765f3f8::fsap-0e4b293d057c3ae3b: operation error EFS: TagResource, exceeded maximum number of attempts, 3, get identity: get credentials: failed to refresh cached credentials, failed to retrieve credentials, operation error STS: AssumeRoleWithWebIdentity, exceeded maximum number of attempts, 3, https response error StatusCode: 0, RequestID: , request send failed, Post "https://sts.us-east-2.amazonaws.com/": dial tcp 52.95.18.19:443: i/o timeout

      Expected results:

        In step6,  the efs_tags_controller could sync the tags successfully.

      Additional info:

        $ och logs aws-efs-csi-driver-operator-7979d7bf9c-kpf69|grep -i 'error'
      ...
      E1120 19:21:11.678829       1 aws_efs_tags_controller.go:256] Error updating tags for PV fs-01e68e58f4765f3f8::fsap-0e4b293d057c3ae3b: operation error EFS: TagResource, exceeded maximum number of attempts, 3, get identity: get credentials: failed to refresh cached credentials, failed to retrieve credentials, operation error STS: AssumeRoleWithWebIdentity, exceeded maximum number of attempts, 3, https response error StatusCode: 0, RequestID: , request send failed, Post "https://sts.us-east-2.amazonaws.com/": dial tcp 52.95.18.19:443: i/o timeout
      E1120 19:21:11.678953       1 aws_efs_tags_queue_worker.go:99] Failed to update tags for volume pvc-b489a13a-9fa2-48ea-b3ba-d0994fcde46b: operation error EFS: TagResource, exceeded maximum number of attempts, 3, get identity: get credentials: failed to refresh cached credentials, failed to retrieve credentials, operation error STS: AssumeRoleWithWebIdentity, exceeded maximum number of attempts, 3, https response error StatusCode: 0, RequestID: , request send failed, Post "https://sts.us-east-2.amazonaws.com/": dial tcp 52.95.18.19:443: i/o timeout
      I1120 19:21:11.679290       1 event_expansion.go:147] "Request Body" body="{\"count\":2,\"lastTimestamp\":\"2025-11-20T19:21:11Z\",\"message\":\"Failed to update tags for volume pvc-b489a13a-9fa2-48ea-b3ba-d0994fcde46b: operation error EFS: TagResource, exceeded maximum number of attempts, 3, get identity: get credentials: failed to refresh cached credentials, failed to retrieve credentials, operation error STS: AssumeRoleWithWebIdentity, exceeded maximum number of attempts, 3, https response error StatusCode: 0, RequestID: , request send failed, Post \\\"https://sts.us-east-2.amazonaws.com/\\\": dial tcp 52.95.18.19:443: i/o timeout\"}"
      I1120 19:21:11.679399       1 event.go:377] Event(v1.ObjectReference{Kind:"Deployment", Namespace:"openshift-cluster-csi-drivers", Name:"aws-efs-csi-driver-operator", UID:"8f92a95a-4f78-4e65-9865-fe1f8c49d902", APIVersion:"apps/v1", ResourceVersion:"", FieldPath:""}): type: 'Warning' reason: 'EFSAccessPointTagsUpdateFailed' Failed to update tags for volume pvc-b489a13a-9fa2-48ea-b3ba-d0994fcde46b: operation error EFS: TagResource, exceeded maximum number of attempts, 3, get identity: get credentials: failed to refresh cached credentials, failed to retrieve credentials, operation error STS: AssumeRoleWithWebIdentity, exceeded maximum number of attempts, 3, https response error StatusCode: 0, RequestID: , request send failed, Post "https://sts.us-east-2.amazonaws.com/": dial tcp 52.95.18.19:443: i/o timeout
      I1120 19:21:11.690010       1 event_expansion.go:147] "Response Body" body="{\"kind\":\"Event\",\"apiVersion\":\"v1\",\"metadata\":{\"name\":\"aws-efs-csi-driver-operator.1879cd09d5f061de\",\"namespace\":\"openshift-cluster-csi-drivers\",\"uid\":\"03e0f294-6dc3-474b-a06b-35ab4dfb32a3\",\"resourceVersion\":\"180083\",\"creationTimestamp\":\"2025-11-20T19:16:11Z\",\"managedFields\":[{\"manager\":\"aws-efs-csi-driver-operator\",\"operation\":\"Update\",\"apiVersion\":\"v1\",\"time\":\"2025-11-20T19:21:11Z\",\"fieldsType\":\"FieldsV1\",\"fieldsV1\":{\"f:count\":{},\"f:firstTimestamp\":{},\"f:involvedObject\":{},\"f:lastTimestamp\":{},\"f:message\":{},\"f:reason\":{},\"f:reportingComponent\":{},\"f:source\":{\"f:component\":{}},\"f:type\":{}}}]},\"involvedObject\":{\"kind\":\"Deployment\",\"namespace\":\"openshift-cluster-csi-drivers\",\"name\":\"aws-efs-csi-driver-operator\",\"uid\":\"8f92a95a-4f78-4e65-9865-fe1f8c49d902\",\"apiVersion\":\"apps/v1\"},\"reason\":\"EFSAccessPointTagsUpdateFailed\",\"message\":\"Failed to update tags for volume pvc-b489a13a-9fa2-48ea-b3ba-d0994fcde46b: operation error EFS: TagResource, exceeded maximum number of attempts, 3, get identity: get credentials: fa [truncated 563 chars]"
      ...
       $  och get deploy/aws-efs-csi-driver-operator -oyaml|grep 'network'
                openshift.storage.network-policy.api-server: allow
                openshift.storage.network-policy.dns: allow
                openshift.storage.network-policy.operator-metrics-range: allow   $ och get networkpolicy
      NAME                                      POD-SELECTOR                                                    AGE
      allow-all-egress                          openshift.storage.network-policy.all-egress=allow               9h
      allow-egress-to-api-server                openshift.storage.network-policy.api-server=allow               9h
      allow-ingress-to-metrics-range            openshift.storage.network-policy.metrics-range=allow            9h
      allow-ingress-to-operator-metrics-range   openshift.storage.network-policy.operator-metrics-range=allow   9h
      allow-to-dns                              openshift.storage.network-policy.dns=allow                      9h
      
      # Compare with ebs csi driver operator
      $ oc get deploy aws-ebs-csi-driver-operator -oyaml|grep 'network'
              openshift.storage.network-policy.all-egress: allow
              openshift.storage.network-policy.api-server: allow
              openshift.storage.network-policy.dns: allow
              openshift.storage.network-policy.operator-metrics-range: allow
                secretName: service-network-admin-kubeconfig
                secretName: service-network-admin-kubeconfig  
      
      I manually hack the csv add "openshift.storage.network-policy.all-egress: allow" annatation to the operator deploy, the connection issue solved, we should also add the all-egress: allow NP for the efs csi driver operator.

              rhn-support-pewang Penghao Wang
              rhn-support-pewang Penghao Wang
              None
              None
              Wei Duan Wei Duan
              None
              Votes:
              0 Vote for this issue
              Watchers:
              2 Start watching this issue

                Created:
                Updated: