Uploaded image for project: 'OpenShift Bugs'
  1. OpenShift Bugs
  2. OCPBUGS-2381

[2078376] [IPI on Alibabacloud] destroying a working cluster would miss 1 or 2 compute nodes so that some other resources cannot be deleted too

    • Icon: Bug Bug
    • Resolution: Unresolved
    • Icon: Undefined Undefined
    • None
    • 4.11
    • None
    • Quality / Stability / Reliability
    • False
    • Hide

      None

      Show
      None
    • None
    • Important
    • None
    • None
    • None
    • Rejected
    • None
    • None
    • None
    • None
    • None
    • None
    • None
    • None

      Version:
      ./openshift-install 4.11.0-0.nightly-2022-04-24-135651
      built from commit 9cf0c5a963bf983ccf997fed46e7bcde81a02569
      release image registry.ci.openshift.org/ocp/release@sha256:3cfd57e4c7cff0807b7811a3a885b336955e1f7b4c646b17975307c350830879
      release architecture amd64

      Platform: alibabacloud

      Please specify: IPI

      What happened?
      Destroying a working cluster doesn't delete all resources of the cluster, e.g. 1 or 2 compute nodes, security groups, load balancers, NAT gateway & EIP, and VPC.

      What did you expect to happen?
      Destroying a working cluster in any region should delete all resources of the cluster.

      How to reproduce it (as minimally and precisely as possible)?
      It seems always, so far we'd tried with regions "us-east-1", "eu-west-1", "eu-central-1" and all have the issue. Besides, we guess the issue had led to VPC/SLB quota used up for running prow CI jobs.

      Anything else we need to know?
      Initially we met the issue when debugging prow-ci jobs (see PR https://github.com/openshift/release/pull/28083), then we tried with QE flexy jobs and noticed the same issue, although not always (e.g. no such issue with region "ap-northeast-1").

      The QE flexy jobs and log snippets:

      https://mastern-jenkins-csb-openshift-qe.apps.ocp-c1.prod.psi.redhat.com/job/ocp-common/job/Flexy-install/97010/ (region: us-east-1)
      https://mastern-jenkins-csb-openshift-qe.apps.ocp-c1.prod.psi.redhat.com/job/ocp-common/job/Flexy-destroy/87853/
      >log snippet of the destroy:
      04-25 13:18:34.763 level=debug msg=OpenShift Installer 4.11.0-0.nightly-2022-04-24-135651
      04-25 13:18:34.763 level=debug msg=Built from commit 9cf0c5a963bf983ccf997fed46e7bcde81a02569
      04-25 13:18:34.763 level=debug msg=Retrieving cloud resources tags=

      {"kubernetes.io/cluster/jiwei-0425-02-5g5cb":"owned"}

      04-25 13:18:36.645 level=debug msg=Retrieving cloud resources tags=

      {"ack.aliyun.com":"jiwei-0425-02-5g5cb"}

      04-25 13:18:37.592 level=debug msg=Searching RAM policy policyName=jiwei-0425-02-5g5cb-policy-bootstrap stage=RAM roles
      04-25 13:18:37.592 level=debug msg=Searching OSS bucket bucketName=jiwei-0425-02-5g5cb-image-registry-us-east-1-fsgpawgrijcevgxpq stage=OSS buckets
      04-25 13:18:37.592 level=debug msg=Searching DNS records stage=DNS records
      04-25 13:18:38.515 level=debug msg=Unbinding tags for OSS bucket bucketName=jiwei-0425-02-5g5cb-image-registry-us-east-1-fsgpawgrijcevgxpq stage=OSS buckets tags=[kubernetes.io/cluster/jiwei-0425-02-5g5cb]
      >04-25 13:18:39.104 level=debug msg=Deleting ECS instances ecsIDs=[i-0xi0csfz5bpct1o0py3g i-0xi0csfz5bpct1o0py3h i-0xifdsmun8b9v22aojmi i-0xi0csfz5bpct9k566fw i-0xi0csfz5bpctbj6a8gl] stage=ECS instances
      04-25 13:18:39.359 level=debug msg=Deleting policyName=jiwei-0425-02-5g5cb-policy-bootstrap stage=RAM roles
      04-25 13:18:40.316 level=debug msg=Searching OSS bucket objects bucketName=jiwei-0425-02-5g5cb-image-registry-us-east-1-fsgpawgrijcevgxpq stage=OSS buckets
      04-25 13:18:40.316 level=debug msg=Deleting OSS bucket bucketName=jiwei-0425-02-5g5cb-image-registry-us-east-1-fsgpawgrijcevgxpq stage=OSS buckets
      04-25 13:18:40.874 level=debug msg=Deleting domain=alicloud-qe.devcluster.openshift.com recordID=759221396576844800 rr=*.apps.jiwei-0425-02 stage=DNS records
      04-25 13:18:41.128 level=debug msg=Deleting domain=alicloud-qe.devcluster.openshift.com recordID=759219941511925760 rr=api.jiwei-0425-02 stage=DNS records
      04-25 13:18:41.413 level=debug msg=Deleting roleName=jiwei-0425-02-5g5cb-role-bootstrap stage=RAM roles
      04-25 13:18:41.971 level=debug msg=Searching RAM policy policyName=jiwei-0425-02-5g5cb-policy-master stage=RAM roles
      04-25 13:18:42.227 level=debug msg=Detaching policy for RAM role policyName=jiwei-0425-02-5g5cb-policy-master principalName=jiwei-0425-02-5g5cb-role-master@role.5724326381648897.onaliyunservice.com stage=RAM roles
      04-25 13:18:43.184 level=debug msg=Public DNS records deleted stage=DNS records
      04-25 13:18:44.125 level=debug msg=Policy detached policyName=jiwei-0425-02-5g5cb-policy-master stage=RAM roles
      04-25 13:18:44.126 level=debug msg=Deleting policyName=jiwei-0425-02-5g5cb-policy-master stage=RAM roles
      04-25 13:18:46.076 level=info msg=OSS bucket deleted bucketName=jiwei-0425-02-5g5cb-image-registry-us-east-1-fsgpawgrijcevgxpq stage=OSS buckets
      04-25 13:18:46.076 level=info msg=OSS buckets deleted stage=OSS buckets
      04-25 13:18:46.076 level=info msg=ECS instances deleted stage=ECS instances
      04-25 13:18:46.076 level=debug msg=Deleting roleName=jiwei-0425-02-5g5cb-role-master stage=RAM roles
      04-25 13:18:46.332 level=debug msg=Searching RAM policy policyName=jiwei-0425-02-5g5cb-policy-worker stage=RAM roles
      04-25 13:18:46.587 level=debug msg=Detaching policy for RAM role policyName=jiwei-0425-02-5g5cb-policy-worker principalName=jiwei-0425-02-5g5cb-role-worker@role.5724326381648897.onaliyunservice.com stage=RAM roles
      04-25 13:18:48.504 level=debug msg=Policy detached policyName=jiwei-0425-02-5g5cb-policy-worker stage=RAM roles
      04-25 13:18:48.504 level=debug msg=Deleting policyName=jiwei-0425-02-5g5cb-policy-worker stage=RAM roles
      04-25 13:18:50.457 level=debug msg=Deleting roleName=jiwei-0425-02-5g5cb-role-worker stage=RAM roles
      04-25 13:18:50.711 level=info msg=RAM roles deleted stage=RAM roles
      04-25 13:18:50.711 level=debug msg=Searching private zone clusterDomain=jiwei-0425-02.alicloud-qe.devcluster.openshift.com stage=private zones
      04-25 13:18:50.965 level=debug msg=Unbinding private zone with vpc stage=private zones zoneID=123f858b8ae4176c562c03846cccae3b
      04-25 13:18:52.891 level=debug msg=Deleting private zone stage=private zones zoneID=123f858b8ae4176c562c03846cccae3b
      04-25 13:18:55.427 level=info msg=Private zones deleted stage=private zones
      04-25 13:18:55.427 level=debug msg=Searching resource groups name=jiwei-0425-02-5g5cb-rg stage=resource groups
      04-25 13:18:55.681 level=debug msg=Purging asset "Metadata" from disk
      04-25 13:18:55.681 level=debug msg=Purging asset "Master Ignition Customization Check" from disk
      04-25 13:18:55.681 level=debug msg=Purging asset "Worker Ignition Customization Check" from disk
      04-25 13:18:55.681 level=debug msg=Purging asset "Terraform Variables" from disk
      04-25 13:18:55.936 level=debug msg=Purging asset "Kubeconfig Admin Client" from disk
      04-25 13:18:55.936 level=debug msg=Purging asset "Kubeadmin Password" from disk
      04-25 13:18:55.936 level=debug msg=Purging asset "Certificate (journal-gatewayd)" from disk
      04-25 13:18:55.936 level=debug msg=Purging asset "Cluster" from disk
      04-25 13:18:55.936 level=info msg=Time elapsed: 21s
      >remaining resources after destroying the cluster:
      $ aliyun resourcemanager ListResources --endpoint "resourcemanager.aliyuncs.com" --Region "us-east-1" --ResourceGroupId "rg-aek2aognijpinoy" --output cols=CreateDate,RegionId,ResourceType,Service,ResourceId rows=Resources.Resource[]
      CreateDate | RegionId | ResourceType | Service | ResourceId
      ---------- | -------- | ------------ | ------- | ----------
      2022-04-25T12:15:49+08:00 | us-east-1 | disk | ecs | d-0xide7x8bi2pz2vblvt9
      2022-04-25T12:15:49+08:00 | us-east-1 | eni | ecs | eni-0xifdsmun8b9v9y9gfcn
      2022-04-25T12:15:49+08:00 | us-east-1 | instance | ecs | i-0xi13jzw8m86er6hirui
      2022-04-25T12:00:39+08:00 | us-east-1 | securitygroup | ecs | sg-0xi0csfz5bpct1nzpsh9
      2022-04-25T12:00:39+08:00 | us-east-1 | securitygroup | ecs | sg-0xi76zpfhbtwaewix3qz
      2022-04-25T12:00:33+08:00 | us-east-1 | eip | eip | eip-0xinty4pdfss7cb6qf2tf
      2022-04-25T12:00:39+08:00 | us-east-1 | loadbalancer | slb | lb-7go55ddlrdycbuz9ha3gv
      2022-04-25T12:01:01+08:00 | us-east-1 | loadbalancer | slb | lb-7gockfv41r0erza1ugg2j
      2022-04-25T12:00:57+08:00 | us-east-1 | natgateway | vpc | ngw-0xi3tiswq7y3r9vc9rdg0
      2022-04-25T12:00:32+08:00 | us-east-1 | vpc | vpc | vpc-0xi83ys8ywf6igy4gucwo

      $

      https://mastern-jenkins-csb-openshift-qe.apps.ocp-c1.prod.psi.redhat.com/job/ocp-common/job/Flexy-install/97011/ (region: us-east-1)
      https://mastern-jenkins-csb-openshift-qe.apps.ocp-c1.prod.psi.redhat.com/job/ocp-common/job/Flexy-destroy/87854/
      >log snippet of the destroy:
      04-25 13:18:48.314 level=debug msg=Deleting ECS instances ecsIDs=[i-0xi2s0tw2fofwinizdgo i-0xi2s0tw2fofwinizdgn i-0xide7x8bi2pywyej7he i-0xide7x8bi2pz2vhvdrg i-0xide7x8bi2pz2vhvdth] stage=ECS instances
      >remaining resources after destroying the cluster:
      $ aliyun resourcemanager ListResources --endpoint "resourcemanager.aliyuncs.com" --Region "us-east-1" --ResourceGroupId "rg-aek2wky7lxk4f5y" --output cols=CreateDate,RegionId,ResourceType,Service,ResourceId rows=Resources.Resource[]
      CreateDate | RegionId | ResourceType | Service | ResourceId
      ---------- | -------- | ------------ | ------- | ----------
      2022-04-25T12:16:46+08:00 | us-east-1 | disk | ecs | d-0xide7x8bi2pz2vblvv7
      2022-04-25T12:16:46+08:00 | us-east-1 | eni | ecs | eni-0xifdsmun8b9v9y9gfda
      2022-04-25T12:16:46+08:00 | us-east-1 | instance | ecs | i-0xifdsmun8b9v9yf4ry9
      2022-04-25T12:01:50+08:00 | us-east-1 | securitygroup | ecs | sg-0xifdsmun8b9v414vl63
      2022-04-25T12:01:50+08:00 | us-east-1 | securitygroup | ecs | sg-0xide7x8bi2pywye08t4
      2022-04-25T12:01:44+08:00 | us-east-1 | eip | eip | eip-0xi4fcumv1uy8dxrqnfe5
      2022-04-25T12:01:47+08:00 | us-east-1 | loadbalancer | slb | lb-7go6weruo4xbtp4bs0s80
      2022-04-25T12:02:13+08:00 | us-east-1 | loadbalancer | slb | lb-7godzg98qanoq811ananq
      2022-04-25T12:02:09+08:00 | us-east-1 | natgateway | vpc | ngw-0xir87ma3xnp9p0cqfd77
      2022-04-25T12:01:44+08:00 | us-east-1 | vpc | vpc | vpc-0xi29vd1j95kwnuh9eo3x

      $

      https://mastern-jenkins-csb-openshift-qe.apps.ocp-c1.prod.psi.redhat.com/job/ocp-common/job/Flexy-install/97043/ (region: eu-west-1)
      https://mastern-jenkins-csb-openshift-qe.apps.ocp-c1.prod.psi.redhat.com/job/ocp-common/job/Flexy-destroy/87861/
      >log snippet of the destroy:
      04-25 14:13:55.912 level=debug msg=Deleting ECS instances ecsIDs=[i-d7ofus77g1ud0rk5geio i-d7oh1c1ktewql2jvjao1 i-d7ob4zfxlfuhw351arfh i-d7oh1c1ktewql8gyvgls i-d7ob4zfxlfuhw924mxdb] stage=ECS instances
      >remaining resources after destroying the cluster:
      $ aliyun resourcemanager ListResources --endpoint "resourcemanager.aliyuncs.com" --Region "eu-west-1" --ResourceGroupId "rg-aek2wky7lxk4f5y" --PageSize 30 --output cols=CreateDate,RegionId,ResourceType,Service,ResourceId rows=Resources.Resource[]
      CreateDate | RegionId | ResourceType | Service | ResourceId
      ---------- | -------- | ------------ | ------- | ----------
      2022-04-25T13:50:11+08:00 | eu-west-1 | disk | ecs | d-d7oh1c1ktewql8gyq9ee
      2022-04-25T13:50:11+08:00 | eu-west-1 | eni | ecs | eni-d7ob4zfxlfuhw9200l3m
      2022-04-25T13:50:11+08:00 | eu-west-1 | instance | ecs | i-d7ofus77g1ud0xh8skgb
      2022-04-25T13:37:40+08:00 | eu-west-1 | securitygroup | ecs | sg-d7oh1c1ktewql2jqmq48
      2022-04-25T13:37:40+08:00 | eu-west-1 | securitygroup | ecs | sg-d7oh1c1ktewql2jqmq49
      2022-04-25T13:37:35+08:00 | eu-west-1 | vpc | vpc | vpc-d7ow0w4vozgqccdo110jz

      $

      https://mastern-jenkins-csb-openshift-qe.apps.ocp-c1.prod.psi.redhat.com/job/ocp-common/job/Flexy-install/97042/ (region: eu-central-1)
      https://mastern-jenkins-csb-openshift-qe.apps.ocp-c1.prod.psi.redhat.com/job/ocp-common/job/Flexy-destroy/87863/
      >log snippet of the destroy:
      04-25 14:22:26.165 level=debug msg=Deleting ECS instances ecsIDs=[i-gw8848gi5x3s3eiz2c58 i-gw8fvvej3dibrwfcx8r1 i-gw8fvvej3dibrwfcx8r2 i-gw8848gi5x3s3mf3ik22] stage=ECS instances
      >remaining resources after destroying the cluster:
      $ aliyun resourcemanager ListResources --endpoint "resourcemanager.aliyuncs.com" --Region "eu-central-1" --ResourceGroupId "rg-aek2aognijpinoy" --PageSize 30 --output cols=CreateDate,RegionId,ResourceType,Service,ResourceId rows=Resources.Resource[]
      CreateDate | RegionId | ResourceType | Service | ResourceId
      ---------- | -------- | ------------ | ------- | ----------
      2022-04-25T13:54:43+08:00 | eu-central-1 | disk | ecs | d-gw8glsylvylkx7kg1n87
      2022-04-25T13:55:53+08:00 | eu-central-1 | disk | ecs | d-gw8ed92ta5rd4igmv3o8
      2022-04-25T13:54:43+08:00 | eu-central-1 | eni | ecs | eni-gw8ed92ta5rd4igjrpsf
      2022-04-25T13:55:53+08:00 | eu-central-1 | eni | ecs | eni-gw8848gi5x3s3mf3iffv
      2022-04-25T13:54:43+08:00 | eu-central-1 | instance | ecs | i-gw8ed92ta5rd4igmbmrt
      2022-04-25T13:55:53+08:00 | eu-central-1 | instance | ecs | i-gw8glsylvylkx7khc6cy
      2022-04-25T13:37:30+08:00 | eu-central-1 | securitygroup | ecs | sg-gw8glsylvylkwzoa55ma
      2022-04-25T13:37:27+08:00 | eu-central-1 | eip | eip | eip-gw8su7cs7bleui26vwmga
      2022-04-25T13:37:26+08:00 | eu-central-1 | vpc | vpc | vpc-gw8nyge09ok2apd5avsmj

      $

              Unassigned Unassigned
              beth.white Beth White
              Gaoyun Pei Gaoyun Pei
              None
              Votes:
              0 Vote for this issue
              Watchers:
              1 Start watching this issue

                Created:
                Updated: