Uploaded image for project: 'OpenShift Bugs'
  1. OpenShift Bugs
  2. OCPBUGS-45937

aws-sdk-go-v2 fails to authenticate AssumeRoleWithWebIdentity on AWS STS clusters

XMLWordPrintable

    • Critical
    • Yes
    • OAPE Sprint 263
    • 1
    • Proposed
    • False
    • Hide

      None

      Show
      None
    • Add a default region to the aws pod-identity-webhook.
    • Bug Fix
    • In Progress

      This is a clone of issue OCPBUGS-41727. The following is the description of the original issue:

      Original bug title:

      cert-manager [v1.15 Regression] Failed to issue certs with ACME Route53 dns01 solver in AWS STS env

      Description of problem:

          When using Route53 as the dns01 solver to create certificates, it fails in both automated and manual tests. For the full log, please refer to the "Actual results" section.

      Version-Release number of selected component (if applicable):

          cert-manager operator v1.15.0 staging build

      How reproducible:

          Always

      Steps to Reproduce: also documented in gist

          1. Install the cert-manager operator 1.15.0
          2. Follow the doc to auth operator with AWS STS using ccoctl: https://docs.openshift.com/container-platform/4.16/security/cert_manager_operator/cert-manager-authenticate.html#cert-manager-configure-cloud-credentials-aws-sts_cert-manager-authenticate
           3. Create a ACME issuer with Route53 dns01 solver
           4. Create a cert using the created issuer

      OR:

      Refer by running `/pj-rehearse pull-ci-openshift-cert-manager-operator-master-e2e-operator-aws-sts` on https://github.com/openshift/release/pull/59568 

      Actual results:

      1. The certificate is not Ready.
      2. The challenge of the cert is stuck in the pending status:
      
      PresentError: Error presenting challenge: failed to change Route 53 record set: operation error Route 53: ChangeResourceRecordSets, get identity: get credentials: failed to refresh cached credentials, failed to retrieve credentials, operation error STS: AssumeRoleWithWebIdentity, failed to resolve service endpoint, endpoint rule error, Invalid Configuration: Missing Region  

      Expected results:

      The certificate should be Ready. The challenge should succeed.

      Additional info:

      The only way to get it working again seems to be injecting the "AWS_REGION" environment variable into the controller pod. See upstream discussion/change:

      I couldn't find a way to inject the env var into our operator-managed operands, so I only verified this workaround using the upstream build v1.15.3. After applying the patch with the following command, the challenge succeeded and the certificate became Ready.

      oc patch deployment cert-manager -n cert-manager \
      --patch '{"spec": {"template": {"spec": {"containers": [{"name": "cert-manager-controller", "env": [{"name": "AWS_REGION", "value": "aws-global"}]}]}}}}' 

              jstuever@redhat.com Jeremiah Stuever
              openshift-crt-jira-prow OpenShift Prow Bot
              Jianping Shu Jianping Shu
              Swarup Ghosh
              Votes:
              0 Vote for this issue
              Watchers:
              4 Start watching this issue

                Created:
                Updated: