Uploaded image for project: 'OpenShift Bugs'
  1. OpenShift Bugs
  2. OCPBUGS-13056

awsendpointservice stuck deleting due to invalid STS token

XMLWordPrintable

    • Important
    • No
    • Hypershift Sprint 236
    • 1
    • Rejected
    • False
    • Hide

      None

      Show
      None

      Description of problem:

      We have seem hosted cluster uninstalls get stuck due to an awsendpointservice stuck in deletion.
      
      The symptoms are that the hostedcontrolplane namespace remains active:
      ❯ k get ns | grep 23fr3262c63nbrjvqbdjh9pqvhaa19dh                                
      ocm-production-23fr3262c63nbrjvqbdjh9pqvhaa19dh                   Terminating   9h
      ocm-production-23fr3262c63nbrjvqbdjh9pqvhaa19dh-zx4ka-8sy24-wr3   Active        9h 
      
      With most of the hostedcontrolplane pods deleted as well as the HCCO pod
      ❯ k get po -n ocm-production-23fr3262c63nbrjvqbdjh9pqvhaa19dh-zx4ka-8sy24-wr3                                                   
      NAME                                                     READY   STATUS              RESTARTS          AGE
      audit-webhook-585fbf86d6-5g99m                           2/2     Running             0                 9h
      audit-webhook-585fbf86d6-rq7kc                           2/2     Running             0                 9h
      aws-ebs-csi-driver-controller-5969bf4d58-2bv4g           7/7     Running             4 (8h ago)        9h
      aws-ebs-csi-driver-operator-64cd745fbd-686p7             1/1     Running             0                 9h
      capi-provider-6f95885d87-d6lb7                           2/2     Running             0                 9h
      cloud-network-config-controller-55c659db46-qk99w         3/3     Running             1 (8h ago)        9h
      cluster-api-59bb8bc5fb-ckn6h                             1/1     Running             0                 9h
      control-plane-operator-64b5646bd9-s5v9p                  2/2     Running             0                 9h
      csi-snapshot-controller-77f8db785-fhvkt                  0/1     CrashLoopBackOff    100 (2m55s ago)   9h
      csi-snapshot-webhook-5c8b88dc97-cmztv                    1/1     Running             0                 9h
      metrics-forwarder-deployment-65db577ff6-6lmv6            0/1     ContainerCreating   0                 7h50m
      metrics-forwarder-secret-ensurer-28052163-mrwnw          0/1     Completed           0                 3m3s
      metrics-forwarder-secret-ensurer-28052164-5v4ln          0/1     Completed           0                 2m3s
      metrics-forwarder-secret-ensurer-28052165-l9bpr          0/1     Completed           0                 63s
      metrics-forwarder-secret-ensurer-28052166-ggtkh          0/1     Completed           0                 3s
      multus-admission-controller-d4b68d656-jhzx6              2/2     Running             0                 9h
      multus-admission-controller-d4b68d656-jz7pz              2/2     Running             0                 9h
      ovnkube-master-0                                         7/7     Running             0                 9h
      ovnkube-master-1                                         7/7     Running             1 (8h ago)        9h
      ovnkube-master-2                                         7/7     Running             0                 9h
      package-operator-remote-phase-manager-66d67b964b-wkm9p   1/1     Running             0                 9h
      validating-webhook-patching-job-28052100-272mt           0/1     Completed           0                 66m
      validating-webhook-patching-job-28052130-tm5s8           0/1     Completed           0                 36m
      validating-webhook-patching-job-28052160-jbk92           0/1     Completed           0                 6m3s
      validation-webhook-6f59c74578-qf4dp                      1/1     Running             0                 9h
      validation-webhook-6f59c74578-vlzx8                      1/1     Running             0                 9h
      
      but the control plane operator pod is unable to cleanup the awsendpointservice due to (in one case):
      
      {"level":"info","ts":"2023-05-03T15:59:50Z","msg":"reconciling","controller":"awsendpointservice","controllerGroup":"hypershift.openshift.io","controllerKind":"AWSEndpointService","aWSEndpointService":{"name":"private-router","namespace":"ocm-production-23fr3262c63nbrjvqbdjh9pqvhaa19dh-zx4ka-8sy24-wr3"},"namespace":"ocm-production-23fr3262c63nbrjvqbdjh9pqvhaa19dh-zx4ka-8sy24-wr3","name":"private-router","reconcileID":"4c52fc93-5e41-4580-b471-04569eb094f9"}
      {"level":"error","ts":"2023-05-03T15:59:50Z","msg":"Reconciler error","controller":"awsendpointservice","controllerGroup":"hypershift.openshift.io","controllerKind":"AWSEndpointService","aWSEndpointService":{"name":"private-router","namespace":"ocm-production-23fr3262c63nbrjvqbdjh9pqvhaa19dh-zx4ka-8sy24-wr3"},"namespace":"ocm-production-23fr3262c63nbrjvqbdjh9pqvhaa19dh-zx4ka-8sy24-wr3","name":"private-router","reconcileID":"4c52fc93-5e41-4580-b471-04569eb094f9","error":"failed to delete resource: WebIdentityErr: failed to retrieve credentials\ncaused by: ExpiredTokenException: Token expired: current date/time 1683129290 must be before the expiration date/time1683102164\n\tstatus code: 400, request id: 88412bad-f912-41c4-af40-a3b62012cb1e","stacktrace":"sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).processNextWorkItem\n\t/hypershift/vendor/sigs.k8s.io/controller-runtime/pkg/internal/controller/controller.go:273\nsigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).Start.func2.2\n\t/hypershift/vendor/sigs.k8s.io/controller-runtime/pkg/internal/controller/controller.go:234"} 
      
      and (in another case)
      {"level":"error","ts":"2023-05-03T16:29:55Z","msg":"Reconciler error","controller":"awsendpointservice","controllerGroup":"hypershift.openshift.io","controllerKind":"AWSEndpointService","aWSEndpointService":{"name":"private-router","namespace":"ocm-staging-23fc4hgctr42l1dvrd3ffnn5mg9eb3ha-mohit-hcp1"},"namespace":"ocm-staging-23fc4hgctr42l1dvrd3ffnn5mg9eb3ha-mohit-hcp1","name":"private-router","reconcileID":"cf294992-fb6e-4c53-ac8e-e9a4bd3f3cb6","error":"failed to delete resource: WebIdentityErr: failed fetching WebIdentity token: \ncaused by: WebIdentityErr: unable to read file at /var/run/secrets/openshift/serviceaccount/token\ncaused by: open /var/run/secrets/openshift/serviceaccount/token: no such file or directory","stacktrace":"sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).processNextWorkItem\n\t/hypershift/vendor/sigs.k8s.io/controller-runtime/pkg/internal/controller/controller.go:273\nsigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).Start.func2.2\n\t/hypershift/vendor/sigs.k8s.io/controller-runtime/pkg/internal/controller/controller.go:234"}

      Version-Release number of selected component (if applicable):

      Observed on 4.12.14 hosted clusters

      How reproducible:

      Unknown at this time, but it is not a one-off there have been at least four cases (three linked as OHSS tickets, one linked as a Slack thread in the comments)

      Steps to Reproduce:

      Unknown at this time
      

      Actual results:

      The awsendpointservice was stuck deleting and unable to progress due to an invalid STS token

      Expected results:

      The awsendpointservice cleanup is able to successfully progress, at least to delete the AWS resources on the management cluster before it gives up on the hosted workers account side.

      Additional info:

      https://redhat-internal.slack.com/archives/C04EUL1DRHC/p1683129924550959 it's not clear if this is the same bug or two different bugs.

              sjenning Seth Jennings
              mshen.openshift Michael Shen
              Jie Zhao Jie Zhao
              Votes:
              0 Vote for this issue
              Watchers:
              11 Start watching this issue

                Created:
                Updated:
                Resolved: