Uploaded image for project: 'OpenShift Bugs'
  1. OpenShift Bugs
  2. OCPBUGS-8665

cert-manager does not work with "Managed Identity Using AAD Pod Identities"

XMLWordPrintable

    • Icon: Bug Bug
    • Resolution: Done-Errata
    • Icon: Minor Minor
    • None
    • 4.13
    • cert-manager
    • Critical
    • No
    • CFE Sprint 234, CFE Sprint 236, CFE Sprint 244
    • 3
    • Rejected
    • False
    • Hide

      None

      Show
      None

      Description of problem:

      cert-manager does not work with "Managed Identity Using AAD Pod Identities", i.e. https://cert-manager.io/docs/configuration/acme/dns01/azuredns/#managed-identity-using-aad-pod-identities .
      

      Version-Release number of selected component (if applicable):

      cert-manager installed with cert-manager-operator-bundle-container-v1.10.2-18 on ipi-on-azure OCP env of payload 4.13.0-0.nightly-2023-03-04-092801
      

      How reproducible:

      Always
      

      Steps to Reproduce:

      1. Launch ipi-on-azure OCP env. Install cert-manager operator of bundle v1.10.2-18 on it.
      2. Follow https://cert-manager.io/docs/configuration/acme/dns01/azuredns/#managed-identity-using-aad-pod-identities . Below are step-by-step details:
      

      Below steps are based on the "Example creation using azure-cli and jq" section in https://cert-manager.io/docs/configuration/acme/dns01/azuredns/#managed-identity-using-aad-pod-identities

      2.1
      Choose a unique Identity name and existing resource group to create identity in.
      $ IDENTITY_GROUP=xxia<snipped>-rg
      $ az group create -l westus -n $IDENTITY_GROUP
      
      $ IDENTITY_NAME=xxia<snipped>-test
      $ az identity create --name $IDENTITY_NAME --resource-group $IDENTITY_GROUP --output json > output/az-identity-create--name--resource-group.json
      
      $ IDENTITY="$(cat output/az-identity-create--name--resource-group.json)"
      
      Gets principalId to use for role assignment
      $ PRINCIPAL_ID=$(echo $IDENTITY | jq -r '.principalId')
      
      Used for identity binding:
      $ CLIENT_ID=$(echo $IDENTITY | jq -r '.clientId')
      $ RESOURCE_ID=$(echo $IDENTITY | jq -r '.id')
      
      $ ZONE_NAME=qe1.azure.devcluster.openshift.com
      $ ZONE_GROUP=<snipped>
      
      Get existing DNS Zone Id
      $ ZONE_ID=$(az network dns zone show --name $ZONE_NAME --resource-group $ZONE_GROUP --query "id" -o tsv)
      
      Create role assignment
      $ az role assignment create --role "DNS Zone Contributor" --assignee $PRINCIPAL_ID --scope $ZONE_ID > output/az-role-assignment-create--assignee_for-pod-aad-test.json
      

      Next we need to ensure we have installed AAD Pod Identity. This will install the CRDs and deployment required to assign the identity. Per https://azure.github.io/aad-pod-identity/docs/configure/deploy_in_openshift/ , the relied /etc/kubernetes/azure.json doesn’t exist in OCP cluster, the AAD Pod Identity will need to be deployed with a managed identity to provide access to Azure, the document is https://azure.github.io/aad-pod-identity/docs/configure/pod_identity_in_managed_mode , below are the steps:

      2.2
      $ wget https://raw.githubusercontent.com/Azure/aad-pod-identity/master/deploy/infra/managed-mode-deployment.yaml
      
      The document says "This installs NMI in managed mode in the kube-system namespace". To ensure success in OCP, do below first:
      $ oc label ns/kube-system pod-security.kubernetes.io/enforce=privileged pod-security.kubernetes.io/audit=privileged pod-security.kubernetes.io/warn=privileged --overwrite
      
      The document also says command "to assign the identity to the VM" (https://azure.github.io/aad-pod-identity/docs/getting-started/role-assignment/#user-assigned-managed-identities-for-self-managed-clusters says same). Do it:
      $ for i in xxia-05az-f55w4-worker-westus-zdkk6 xxia-05az-f55w4-worker-westus-pblrl xxia-05az-f55w4-worker-westus-bxbvk xxia-05az-f55w4-master-0 xxia-05az-f55w4-master-1 xxia-05az-f55w4-master-2
      do
        az vm identity assign -g xxia-05az-f55w4-rg -n $i --identities "/subscriptions/snipped-subscription-id/resourcegroups/xxia-snipped-rg/providers/Microsoft.ManagedIdentity/userAssignedIdentities/xxia-snipped-test"
      done
      
      $ oc create -f managed-mode-deployment.yaml
      serviceaccount/aad-pod-id-nmi-service-account created
      customresourcedefinition.apiextensions.k8s.io/azureidentities.aadpodidentity.k8s.io created
      customresourcedefinition.apiextensions.k8s.io/azureidentitybindings.aadpodidentity.k8s.io created
      customresourcedefinition.apiextensions.k8s.io/azurepodidentityexceptions.aadpodidentity.k8s.io created
      clusterrole.rbac.authorization.k8s.io/aad-pod-id-nmi-role created
      clusterrolebinding.rbac.authorization.k8s.io/aad-pod-id-nmi-binding created
      daemonset.apps/nmi created
      
      $ oc adm policy add-scc-to-user privileged -z aad-pod-id-nmi-service-account -n kube-system
      
      $ oc get po -n kube-system
      NAME        READY   STATUS    RESTARTS   AGE
      nmi-76rnh   1/1     Running   0          15s
      nmi-dvw2j   1/1     Running   0          15s
      nmi-hkzll   1/1     Running   0          15s
      nmi-n7xc6   1/1     Running   0          15s
      nmi-p9v9x   1/1     Running   0          15s
      nmi-ps2hh   1/1     Running   0          15s
      

      Now we can create the identity resource and binding using the below manifest (copied from https://cert-manager.io/docs/configuration/acme/dns01/azuredns/#managed-identity-using-aad-pod-identities)

      2.3
      $ cat azureidentity-resources.yaml 
      apiVersion: "aadpodidentity.k8s.io/v1"
      kind: AzureIdentity
      metadata:
        annotations:
          aadpodidentity.k8s.io/Behavior: namespaced
        name: certman-identity
        namespace: cert-manager
      spec:
        type: 0
        resourceID: snipped # Resource Id From previous step
        clientID: snipped # Client Id from previous step
      ---
      apiVersion: "aadpodidentity.k8s.io/v1"
      kind: AzureIdentityBinding
      metadata:
        name: certman-id-binding
        namespace: cert-manager
      spec:
        azureIdentity: certman-identity
        selector: certman-label # This is the label that needs to be set on cert-manager pods
      
      $ oc create -f azureidentity-resources.yaml
      

      Next we need to ensure the cert-manager pod has a relevant label to use the pod identity binding. This can be done by editing the deployment and adding the below into the .spec.template.metadata.labels field (said in https://cert-manager.io/docs/configuration/acme/dns01/azuredns/#managed-identity-using-aad-pod-identities)

      2.4
      $ oc edit deployment cert-manager -n cert-manager
      spec:
        template:
          metadata:
            labels:
              aadpodidbinding: certman-label # must match previous step's "selector"
      

      Note, because cert-manager is managed by cert-manager operator, we can't edit the deployment otherwise it will be automatically reverted. So, first run oc edit certmanager cluster, change managementState to "Unmanaged", then do the editing. This is reported in a separate https://issues.redhat.com/browse/OCPBUGS-8466

      2.5 Wait for cert-manager pod is renewed
      $ oc get po -n cert-manager
      NAME                                       READY   STATUS    RESTARTS   AGE
      cert-manager-697967c4b7-5kslq              1/1     Running   1          2m
      ...
      

      Create clusterissuer (based on https://cert-manager.io/docs/configuration/acme/dns01/azuredns/#managed-identity-using-aad-pod-identities example)

      2.6
      $ cat clusterissuer-acme-dns01-azuredns-aad-pod-identity.yaml
      apiVersion: cert-manager.io/v1
      kind: ClusterIssuer
      metadata:
        name: use-aad-pod-identity
      spec:
        acme:
          preferredChain: ""
          privateKeySecretRef:
            name: letsencrypt
          server: https://acme-staging-v02.api.letsencrypt.org/directory
          solvers:
          - dns01:
              azureDNS:
                subscriptionID: snipped
                resourceGroupName: snipped
                hostedZoneName: qe1.azure.devcluster.openshift.com
                environment: AzurePublicCloud
      
      $ oc create -f clusterissuer-acme-dns01-azuredns-aad-pod-identity.yaml
      $ oc get clusterissuer -o wide
      NAME                   READY   STATUS                                                 AGE
      use-aad-pod-identity   True    The ACME account was registered with the ACME server   3m
      

      Create certificate

      $ oc login -u snipped -p snipped
      $ oc new-project xxia-proj-3
      $ cat cert-from-issuer-with-aad-pod-identity.yaml 
      apiVersion: cert-manager.io/v1
      kind: Certificate
      metadata:
        name: cert4-from-issuer-with-aad-pod-identity
      spec:
        secretName: cert4-from-issuer-with-aad-pod-identity
        issuerRef:
          kind: ClusterIssuer
          name: use-aad-pod-identity
        dnsNames:
        - xxia-test-4.qe1.azure.devcluster.openshift.com
        - '*.xxia-test-4.qe1.azure.devcluster.openshift.com'
      
      $ oc create -f cert-from-issuer-with-aad-pod-identity.yaml
      

      Checke the certificate

      4.
      $ oc get cert
      NAME                                      READY   SECRET                                    AGE
      cert4-from-issuer-with-aad-pod-identity   False   cert4-from-issuer-with-aad-pod-identity   51m
      
      $ oc get challenge
      NAME                                                              STATE     DOMAIN                                           AGE
      cert4-from-issuer-with-aad-pod-identity-64487-725540-3403810812             xxia-test-4.qe1.azure.devcluster.openshift.com   52m
      cert4-from-issuer-with-aad-pod-identity-64487-725540-4176589525   pending   xxia-test-4.qe1.azure.devcluster.openshift.com   52m
      
      $ oc get order -o wide
      NAME                                                    STATE     ISSUER                 REASON   AGE
      cert4-from-issuer-with-aad-pod-identity-64487-7255405   pending   use-aad-pod-identity            52m
      
      $ oc get challenge cert4-from-issuer-with-aad-pod-identity-64487-725540-4176589525 -o yaml
      ...
      status:
        presented: false
        processing: true
        reason: 'azure.BearerAuthorizer#WithAuthorization: Failed to refresh the Token for
          request to https://management.azure.com/subscriptions/snipped-subscription-id/resourceGroups/snipped-dns-zone-resource-group/providers/Microsoft.Network/dnsZones/qe1.azure.devcluster.openshift.com/TXT/_acme-challenge.xxia-test-4?api-version=2017-10-01:
          StatusCode=0 -- Original Error: adal: Failed to execute the refresh request. Error
          = ''Get "http://169.254.169.254/metadata/identity/oauth2/token?api-version=2018-02-01&mi_res_id=%2Fsubscriptions%2Fsnipped-subscription-id%2Fresourcegroups%2Fxxia-snipped-rg%2Fproviders%2FMicrosoft.ManagedIdentity%2FuserAssignedIdentities%2Fxxia-snipped-test&resource=https%3A%2F%2Fmanagement.core.windows.net%2F":
          dial tcp 169.254.169.254:80: connect: connection refused'''
        state: pending
      

      Check logs

      5.
      nmi pods in-previous-step don't show error in logs
      $ oc logs --timestamps nmi-pod-name-in-previous-step -n kube-system --context admin
      2023-03-08T03:04:56.849421039Z I0308 03:04:56.849330       1 main.go:90] starting nmi process. Version: v1.8.14. Build date: 2022-12-13-18:34.
      2023-03-08T03:04:56.849421039Z I0308 03:04:56.849385       1 main.go:103] features for scale clusters enabled
      2023-03-08T03:04:57.152738814Z I0308 03:04:57.152669       1 crd.go:448] CRD lite informers started 
      2023-03-08T03:04:57.253938539Z I0308 03:04:57.253810       1 main.go:117] running NMI in namespaced mode: true
      2023-03-08T03:04:57.253938539Z I0308 03:04:57.253856       1 nmi.go:53] initializing in managed mode
      2023-03-08T03:04:57.253938539Z I0308 03:04:57.253867       1 probes.go:41] initialized health probe on port 8085
      2023-03-08T03:04:57.253938539Z I0308 03:04:57.253878       1 probes.go:44] started health probe
      2023-03-08T03:04:57.254020639Z I0308 03:04:57.253986       1 metrics.go:341] registered views for metric
      2023-03-08T03:04:57.254170339Z I0308 03:04:57.254128       1 prometheus_exporter.go:21] starting Prometheus exporter
      2023-03-08T03:04:57.254170339Z I0308 03:04:57.254156       1 metrics.go:347] registered and exported metrics on port 9090
      2023-03-08T03:04:57.254425339Z I0308 03:04:57.254314       1 server.go:127] listening on 127.0.0.1:2579
      2023-03-08T03:04:57.358128765Z W0308 03:04:57.358081       1 iptables.go:123] flushing iptables to add aad-metadata custom chains
      
      cert-manager pods have below logs
      $ oc logs --timestamps cert-manager-697967c4b7-5kslq --context admin -n cert-manager
      ...
      2023-03-08T04:37:55.066624559Z E0308 04:37:55.066544       1 azuredns.go:157] cert-manager/azure-dns "msg"="Error creating TXT:" "error"="azure.BearerAuthorizer#WithAuthorization: Failed to refresh the Token for request to https://management.azure.com/subscriptions/snipped-subscription-id/resourceGroups/snipped-dns-zone-resource-group/providers/Microsoft.Network/dnsZones/qe1.azure.devcluster.openshift.com/TXT/_acme-challenge.xxia-test-4?api-version=2017-10-01: StatusCode=0 -- Original Error: adal: Failed to execute the refresh request. Error = 'Get \"http://169.254.169.254/metadata/identity/oauth2/token?api-version=2018-02-01&mi_res_id=%2Fsubscriptions%2Fsnipped-subscription-id%2Fresourcegroups%2Fxxia-snipped-rg%2Fproviders%2FMicrosoft.ManagedIdentity%2FuserAssignedIdentities%2Fxxia-snipped-test&resource=https%3A%2F%2Fmanagement.core.windows.net%2F\": dial tcp 169.254.169.254:80: connect: connection refused'" "qe1.azure.devcluster.openshift.com"="(MISSING)"
      2023-03-08T04:37:55.067033060Z E0308 04:37:55.066992       1 controller.go:167] cert-manager/challenges "msg"="re-queuing item due to error processing" "error"="azure.BearerAuthorizer#WithAuthorization: Failed to refresh the Token for request to https://management.azure.com/subscriptions/snipped-subscription-id/resourceGroups/snipped-dns-zone-resource-group/providers/Microsoft.Network/dnsZones/qe1.azure.devcluster.openshift.com/TXT/_acme-challenge.xxia-test-4?api-version=2017-10-01: StatusCode=0 -- Original Error: adal: Failed to execute the refresh request. Error = 'Get \"http://169.254.169.254/metadata/identity/oauth2/token?api-version=2018-02-01&mi_res_id=%2Fsubscriptions%2Fsnipped-subscription-id%2Fresourcegroups%2Fxxia-snipped-rg%2Fproviders%2FMicrosoft.ManagedIdentity%2FuserAssignedIdentities%2Fxxia-snipped-test&resource=https%3A%2F%2Fmanagement.core.windows.net%2F\": dial tcp 169.254.169.254:80: connect: connection refused'" "key"="xxia-proj-3/cert4-from-issuer-with-aad-pod-identity-64487-725540-4176589525"
      2023-03-08T04:37:55.067163860Z I0308 04:37:55.067108       1 azuredns.go:87] cert-manager "msg"="No ClientID found:  authenticating azuredns with managed identity (MSI)"
      ...
      

      Actual results:

      cert-manager does not work with "Managed Identity Using AAD Pod Identities", as detailed in above steps.
      

      Expected results:

      cert-manager should work with "Managed Identity Using AAD Pod Identities", as detailed in above steps.
      

      Additional info:
      Researched a bit to debug, e.g. found https://github.com/cert-manager/cert-manager/issues/3148 and then tried to add managed identity in clusterissuer as below:

      ...
          solvers:
          - dns01:
              azureDNS:
                environment: AzurePublicCloud
                hostedZoneName: qe1.azure.devcluster.openshift.com
                managedIdentity:
                  resourceID: snipped-RESOURCE_ID-in-previous-step
                resourceGroupName: snipped
                subscriptionID: snipped
      

      This does not address the problem. The problem is still reproduced.

              swghosh@redhat.com Swarup Ghosh
              xxia-1 Xingxing Xia
              Yuedong Wu Yuedong Wu
              Shubha Narayanan Shubha Narayanan
              Thejas N (Inactive)
              Votes:
              0 Vote for this issue
              Watchers:
              7 Start watching this issue

                Created:
                Updated:
                Resolved:

                  Estimated:
                  Original Estimate - Not Specified
                  Not Specified
                  Remaining:
                  Remaining Estimate - 0 minutes
                  0m
                  Logged:
                  Time Spent - 4 hours
                  4h