-
Bug
-
Resolution: Not a Bug
-
Critical
-
None
-
4.12.z
-
Important
-
No
-
Hypershift Sprint 236
-
1
-
Rejected
-
False
-
Description of problem:
We have seem hosted cluster uninstalls get stuck due to an awsendpointservice stuck in deletion. The symptoms are that the hostedcontrolplane namespace remains active: ❯ k get ns | grep 23fr3262c63nbrjvqbdjh9pqvhaa19dh ocm-production-23fr3262c63nbrjvqbdjh9pqvhaa19dh Terminating 9h ocm-production-23fr3262c63nbrjvqbdjh9pqvhaa19dh-zx4ka-8sy24-wr3 Active 9h With most of the hostedcontrolplane pods deleted as well as the HCCO pod ❯ k get po -n ocm-production-23fr3262c63nbrjvqbdjh9pqvhaa19dh-zx4ka-8sy24-wr3 NAME READY STATUS RESTARTS AGE audit-webhook-585fbf86d6-5g99m 2/2 Running 0 9h audit-webhook-585fbf86d6-rq7kc 2/2 Running 0 9h aws-ebs-csi-driver-controller-5969bf4d58-2bv4g 7/7 Running 4 (8h ago) 9h aws-ebs-csi-driver-operator-64cd745fbd-686p7 1/1 Running 0 9h capi-provider-6f95885d87-d6lb7 2/2 Running 0 9h cloud-network-config-controller-55c659db46-qk99w 3/3 Running 1 (8h ago) 9h cluster-api-59bb8bc5fb-ckn6h 1/1 Running 0 9h control-plane-operator-64b5646bd9-s5v9p 2/2 Running 0 9h csi-snapshot-controller-77f8db785-fhvkt 0/1 CrashLoopBackOff 100 (2m55s ago) 9h csi-snapshot-webhook-5c8b88dc97-cmztv 1/1 Running 0 9h metrics-forwarder-deployment-65db577ff6-6lmv6 0/1 ContainerCreating 0 7h50m metrics-forwarder-secret-ensurer-28052163-mrwnw 0/1 Completed 0 3m3s metrics-forwarder-secret-ensurer-28052164-5v4ln 0/1 Completed 0 2m3s metrics-forwarder-secret-ensurer-28052165-l9bpr 0/1 Completed 0 63s metrics-forwarder-secret-ensurer-28052166-ggtkh 0/1 Completed 0 3s multus-admission-controller-d4b68d656-jhzx6 2/2 Running 0 9h multus-admission-controller-d4b68d656-jz7pz 2/2 Running 0 9h ovnkube-master-0 7/7 Running 0 9h ovnkube-master-1 7/7 Running 1 (8h ago) 9h ovnkube-master-2 7/7 Running 0 9h package-operator-remote-phase-manager-66d67b964b-wkm9p 1/1 Running 0 9h validating-webhook-patching-job-28052100-272mt 0/1 Completed 0 66m validating-webhook-patching-job-28052130-tm5s8 0/1 Completed 0 36m validating-webhook-patching-job-28052160-jbk92 0/1 Completed 0 6m3s validation-webhook-6f59c74578-qf4dp 1/1 Running 0 9h validation-webhook-6f59c74578-vlzx8 1/1 Running 0 9h but the control plane operator pod is unable to cleanup the awsendpointservice due to (in one case): {"level":"info","ts":"2023-05-03T15:59:50Z","msg":"reconciling","controller":"awsendpointservice","controllerGroup":"hypershift.openshift.io","controllerKind":"AWSEndpointService","aWSEndpointService":{"name":"private-router","namespace":"ocm-production-23fr3262c63nbrjvqbdjh9pqvhaa19dh-zx4ka-8sy24-wr3"},"namespace":"ocm-production-23fr3262c63nbrjvqbdjh9pqvhaa19dh-zx4ka-8sy24-wr3","name":"private-router","reconcileID":"4c52fc93-5e41-4580-b471-04569eb094f9"} {"level":"error","ts":"2023-05-03T15:59:50Z","msg":"Reconciler error","controller":"awsendpointservice","controllerGroup":"hypershift.openshift.io","controllerKind":"AWSEndpointService","aWSEndpointService":{"name":"private-router","namespace":"ocm-production-23fr3262c63nbrjvqbdjh9pqvhaa19dh-zx4ka-8sy24-wr3"},"namespace":"ocm-production-23fr3262c63nbrjvqbdjh9pqvhaa19dh-zx4ka-8sy24-wr3","name":"private-router","reconcileID":"4c52fc93-5e41-4580-b471-04569eb094f9","error":"failed to delete resource: WebIdentityErr: failed to retrieve credentials\ncaused by: ExpiredTokenException: Token expired: current date/time 1683129290 must be before the expiration date/time1683102164\n\tstatus code: 400, request id: 88412bad-f912-41c4-af40-a3b62012cb1e","stacktrace":"sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).processNextWorkItem\n\t/hypershift/vendor/sigs.k8s.io/controller-runtime/pkg/internal/controller/controller.go:273\nsigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).Start.func2.2\n\t/hypershift/vendor/sigs.k8s.io/controller-runtime/pkg/internal/controller/controller.go:234"} and (in another case) {"level":"error","ts":"2023-05-03T16:29:55Z","msg":"Reconciler error","controller":"awsendpointservice","controllerGroup":"hypershift.openshift.io","controllerKind":"AWSEndpointService","aWSEndpointService":{"name":"private-router","namespace":"ocm-staging-23fc4hgctr42l1dvrd3ffnn5mg9eb3ha-mohit-hcp1"},"namespace":"ocm-staging-23fc4hgctr42l1dvrd3ffnn5mg9eb3ha-mohit-hcp1","name":"private-router","reconcileID":"cf294992-fb6e-4c53-ac8e-e9a4bd3f3cb6","error":"failed to delete resource: WebIdentityErr: failed fetching WebIdentity token: \ncaused by: WebIdentityErr: unable to read file at /var/run/secrets/openshift/serviceaccount/token\ncaused by: open /var/run/secrets/openshift/serviceaccount/token: no such file or directory","stacktrace":"sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).processNextWorkItem\n\t/hypershift/vendor/sigs.k8s.io/controller-runtime/pkg/internal/controller/controller.go:273\nsigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).Start.func2.2\n\t/hypershift/vendor/sigs.k8s.io/controller-runtime/pkg/internal/controller/controller.go:234"}
Version-Release number of selected component (if applicable):
Observed on 4.12.14 hosted clusters
How reproducible:
Unknown at this time, but it is not a one-off there have been at least four cases (three linked as OHSS tickets, one linked as a Slack thread in the comments)
Steps to Reproduce:
Unknown at this time
Actual results:
The awsendpointservice was stuck deleting and unable to progress due to an invalid STS token
Expected results:
The awsendpointservice cleanup is able to successfully progress, at least to delete the AWS resources on the management cluster before it gives up on the hosted workers account side.
Additional info:
https://redhat-internal.slack.com/archives/C04EUL1DRHC/p1683129924550959 it's not clear if this is the same bug or two different bugs.
- relates to
-
OCPBUGS-13184 Uninstalled hostedcontrolplane leaves some PDBs behind
- Closed