Uploaded image for project: 'OpenShift Bugs'
  1. OpenShift Bugs
  2. OCPBUGS-27156

[gcp] destroying the problem cluster unexpectedly deletes the dns record-sets not created by the installer

    XMLWordPrintable

Details

    • Moderate
    • No
    • Rejected
    • False
    • Hide

      None

      Show
      None
    • Hide
      A cluster fails, and the user attempts a second install with the same cluster name. When the user attempts to destroy the first cluster, the dns record sets are destroyed in the second cluster. The destruction of the dns record sets causes the install to fail. To fix this issue, the stored metadata will contain the private zone name rather than the cluster domain. An exact match for the private zone will be made, so there is no confusion over what dns records will be deleted.
      Show
      A cluster fails, and the user attempts a second install with the same cluster name. When the user attempts to destroy the first cluster, the dns record sets are destroyed in the second cluster. The destruction of the dns record sets causes the install to fail. To fix this issue, the stored metadata will contain the private zone name rather than the cluster domain. An exact match for the private zone will be made, so there is no confusion over what dns records will be deleted.
    • Bug Fix
    • In Progress

    Description

      Description of problem:

         Trying to create the second cluster using the same cluster name and base domain as the first cluster would fail, as expected, because of the dns record-sets conflicts. But deleting the second cluster leads to the first cluster inaccessible, which is unexpected. 

      Version-Release number of selected component (if applicable):

          4.15.0-0.nightly-2024-01-14-100410

      How reproducible:

          Always

      Steps to Reproduce:

      1. create the first cluster and make sure it succeeds
      2. try to create the second cluster, with the same cluster name, base domain, and region, and make sure it failed
      3. destroy the second cluster which failed due to "Platform Provisioning Check"
      4. check if the first cluster is still healthy     

      Actual results:

          The first cluster turns unhealthy, because the dns record-sets are deleted by step3

      Expected results:

          The dns record-sets of the first cluster stay untouched during step3, and the the first cluster stays healthy after step3.

      Additional info:

      (1) the first cluster is by Flexy-install job https://mastern-jenkins-csb-openshift-qe.apps.ocp-c1.prod.psi.redhat.com/job/ocp-common/job/Flexy-install/257549/, and it's healthy initially
      
      $ oc get clusterversion
      NAME      VERSION                              AVAILABLE   PROGRESSING   SINCE   STATUS
      version   4.15.0-0.nightly-2024-01-14-100410   True        False         54m     Cluster version is 4.15.0-0.nightly-2024-01-14-100410
      $ oc get nodes
      NAME                                                       STATUS   ROLES                  AGE   VERSION
      jiwei-0115y-lgns8-master-0.c.openshift-qe.internal         Ready    control-plane,master   73m   v1.28.5+c84a6b8
      jiwei-0115y-lgns8-master-1.c.openshift-qe.internal         Ready    control-plane,master   73m   v1.28.5+c84a6b8
      jiwei-0115y-lgns8-master-2.c.openshift-qe.internal         Ready    control-plane,master   74m   v1.28.5+c84a6b8
      jiwei-0115y-lgns8-worker-a-gqq96.c.openshift-qe.internal   Ready    worker                 62m   v1.28.5+c84a6b8
      jiwei-0115y-lgns8-worker-b-2h9xd.c.openshift-qe.internal   Ready    worker                 63m   v1.28.5+c84a6b8
      $ 
      
      (2) try to create the second cluster and expect failing due to dns record already exists
      
      $ openshift-install version
      openshift-install 4.15.0-0.nightly-2024-01-14-100410
      built from commit b6f320ab7eeb491b2ef333a16643c140239de0e5
      release image registry.ci.openshift.org/ocp/release@sha256:385d84c803c776b44ce77b80f132c1b6ed10bd590f868c97e3e63993b811cc2d
      release architecture amd64
      $ mkdir test1
      $ cp install-config.yaml test1
      $ yq-3.3.0 r test1/install-config.yaml baseDomain
      qe.gcp.devcluster.openshift.com
      $ yq-3.3.0 r test1/install-config.yaml metadata
      creationTimestamp: null
      name: jiwei-0115y
      $ yq-3.3.0 r test1/install-config.yaml platform
      gcp:
        projectID: openshift-qe
        region: us-central1
      $ openshift-install create cluster --dir test1
      INFO Credentials loaded from file "/home/fedora/.gcp/osServiceAccount.json" 
      INFO Consuming Install Config from target directory 
      FATAL failed to fetch Terraform Variables: failed to fetch dependency of "Terraform Variables": failed to generate asset "Platform Provisioning Check": metadata.name: Invalid value: "jiwei-0115y": record(s) ["api.jiwei-0115y.qe.gcp.devcluster.openshift.com."] already exists in DNS Zone (openshift-qe/qe) and might be in use by another cluster, please remove it to continue 
      $ 
      
      (3) delete the second cluster
      
      $ openshift-install destroy cluster --dir test1
      INFO Credentials loaded from file "/home/fedora/.gcp/osServiceAccount.json" 
      INFO Deleted 2 recordset(s) in zone qe            
      INFO Deleted 3 recordset(s) in zone jiwei-0115y-lgns8-private-zone 
      WARNING Skipping deletion of DNS Zone jiwei-0115y-lgns8-private-zone, not created by installer 
      INFO Time elapsed: 37s                            
      INFO Uninstallation complete!                     
      $ 
      
      (4) check the first cluster status and the dns record-sets
      
      $ oc get clusterversion
      Unable to connect to the server: dial tcp: lookup api.jiwei-0115y.qe.gcp.devcluster.openshift.com on 10.11.5.160:53: no such host
      $
      $ gcloud dns managed-zones describe jiwei-0115y-lgns8-private-zone
      cloudLoggingConfig:
        kind: dns#managedZoneCloudLoggingConfig
      creationTime: '2024-01-15T07:22:55.199Z'
      description: Created By OpenShift Installer
      dnsName: jiwei-0115y.qe.gcp.devcluster.openshift.com.
      id: '9193862213315831261'
      kind: dns#managedZone
      labels:
        kubernetes-io-cluster-jiwei-0115y-lgns8: owned
      name: jiwei-0115y-lgns8-private-zone
      nameServers:
      - ns-gcp-private.googledomains.com.
      privateVisibilityConfig:
        kind: dns#managedZonePrivateVisibilityConfig
        networks:
        - kind: dns#managedZonePrivateVisibilityConfigNetwork
          networkUrl: https://www.googleapis.com/compute/v1/projects/openshift-qe/global/networks/jiwei-0115y-lgns8-network
      visibility: private
      $ gcloud dns record-sets list --zone jiwei-0115y-lgns8-private-zone
      NAME                                          TYPE  TTL    DATA
      jiwei-0115y.qe.gcp.devcluster.openshift.com.  NS    21600  ns-gcp-private.googledomains.com.
      jiwei-0115y.qe.gcp.devcluster.openshift.com.  SOA   21600  ns-gcp-private.googledomains.com. cloud-dns-hostmaster.google.com. 1 21600 3600 259200 300
      $ gcloud dns record-sets list --zone qe --filter='name~jiwei-0115y'
      Listed 0 items.
      $ 
      

      Attachments

        Issue Links

          Activity

            People

              rh-ee-bbarbach Brent Barbachem
              rhn-support-jiwei Jianli Wei
              Jianli Wei Jianli Wei
              Votes:
              0 Vote for this issue
              Watchers:
              8 Start watching this issue

              Dates

                Created:
                Updated: