Uploaded image for project: 'OpenShift Bugs'
  1. OpenShift Bugs
  2. OCPBUGS-27156

[gcp] destroying the problem cluster unexpectedly deletes the dns record-sets not created by the installer

    • Moderate
    • No
    • Rejected
    • False
    • Hide

      None

      Show
      None
    • Hide
      * Previously, when a cluster installation was attempted using the same cluster name and base domain as an existing cluster, and the installation failed due to DNS record set conflicts, destroying the second cluster would also destroy the DNS record sets in the original cluster. With this update, the stored metadata contains the private zone name rather than the cluster domain, so only the correct DNS records are deleted upon cluster destruction. (link:https://issues.redhat.com/browse/OCPBUGS-27156[*OCPBUGS-27156*])
      Show
      * Previously, when a cluster installation was attempted using the same cluster name and base domain as an existing cluster, and the installation failed due to DNS record set conflicts, destroying the second cluster would also destroy the DNS record sets in the original cluster. With this update, the stored metadata contains the private zone name rather than the cluster domain, so only the correct DNS records are deleted upon cluster destruction. (link: https://issues.redhat.com/browse/OCPBUGS-27156 [* OCPBUGS-27156 *])
    • Bug Fix
    • Done

      Description of problem:

         Trying to create the second cluster using the same cluster name and base domain as the first cluster would fail, as expected, because of the dns record-sets conflicts. But deleting the second cluster leads to the first cluster inaccessible, which is unexpected. 

      Version-Release number of selected component (if applicable):

          4.15.0-0.nightly-2024-01-14-100410

      How reproducible:

          Always

      Steps to Reproduce:

      1. create the first cluster and make sure it succeeds
      2. try to create the second cluster, with the same cluster name, base domain, and region, and make sure it failed
      3. destroy the second cluster which failed due to "Platform Provisioning Check"
      4. check if the first cluster is still healthy     

      Actual results:

          The first cluster turns unhealthy, because the dns record-sets are deleted by step3

      Expected results:

          The dns record-sets of the first cluster stay untouched during step3, and the the first cluster stays healthy after step3.

      Additional info:

      (1) the first cluster is by Flexy-install job https://mastern-jenkins-csb-openshift-qe.apps.ocp-c1.prod.psi.redhat.com/job/ocp-common/job/Flexy-install/257549/, and it's healthy initially
      
      $ oc get clusterversion
      NAME      VERSION                              AVAILABLE   PROGRESSING   SINCE   STATUS
      version   4.15.0-0.nightly-2024-01-14-100410   True        False         54m     Cluster version is 4.15.0-0.nightly-2024-01-14-100410
      $ oc get nodes
      NAME                                                       STATUS   ROLES                  AGE   VERSION
      jiwei-0115y-lgns8-master-0.c.openshift-qe.internal         Ready    control-plane,master   73m   v1.28.5+c84a6b8
      jiwei-0115y-lgns8-master-1.c.openshift-qe.internal         Ready    control-plane,master   73m   v1.28.5+c84a6b8
      jiwei-0115y-lgns8-master-2.c.openshift-qe.internal         Ready    control-plane,master   74m   v1.28.5+c84a6b8
      jiwei-0115y-lgns8-worker-a-gqq96.c.openshift-qe.internal   Ready    worker                 62m   v1.28.5+c84a6b8
      jiwei-0115y-lgns8-worker-b-2h9xd.c.openshift-qe.internal   Ready    worker                 63m   v1.28.5+c84a6b8
      $ 
      
      (2) try to create the second cluster and expect failing due to dns record already exists
      
      $ openshift-install version
      openshift-install 4.15.0-0.nightly-2024-01-14-100410
      built from commit b6f320ab7eeb491b2ef333a16643c140239de0e5
      release image registry.ci.openshift.org/ocp/release@sha256:385d84c803c776b44ce77b80f132c1b6ed10bd590f868c97e3e63993b811cc2d
      release architecture amd64
      $ mkdir test1
      $ cp install-config.yaml test1
      $ yq-3.3.0 r test1/install-config.yaml baseDomain
      qe.gcp.devcluster.openshift.com
      $ yq-3.3.0 r test1/install-config.yaml metadata
      creationTimestamp: null
      name: jiwei-0115y
      $ yq-3.3.0 r test1/install-config.yaml platform
      gcp:
        projectID: openshift-qe
        region: us-central1
      $ openshift-install create cluster --dir test1
      INFO Credentials loaded from file "/home/fedora/.gcp/osServiceAccount.json" 
      INFO Consuming Install Config from target directory 
      FATAL failed to fetch Terraform Variables: failed to fetch dependency of "Terraform Variables": failed to generate asset "Platform Provisioning Check": metadata.name: Invalid value: "jiwei-0115y": record(s) ["api.jiwei-0115y.qe.gcp.devcluster.openshift.com."] already exists in DNS Zone (openshift-qe/qe) and might be in use by another cluster, please remove it to continue 
      $ 
      
      (3) delete the second cluster
      
      $ openshift-install destroy cluster --dir test1
      INFO Credentials loaded from file "/home/fedora/.gcp/osServiceAccount.json" 
      INFO Deleted 2 recordset(s) in zone qe            
      INFO Deleted 3 recordset(s) in zone jiwei-0115y-lgns8-private-zone 
      WARNING Skipping deletion of DNS Zone jiwei-0115y-lgns8-private-zone, not created by installer 
      INFO Time elapsed: 37s                            
      INFO Uninstallation complete!                     
      $ 
      
      (4) check the first cluster status and the dns record-sets
      
      $ oc get clusterversion
      Unable to connect to the server: dial tcp: lookup api.jiwei-0115y.qe.gcp.devcluster.openshift.com on 10.11.5.160:53: no such host
      $
      $ gcloud dns managed-zones describe jiwei-0115y-lgns8-private-zone
      cloudLoggingConfig:
        kind: dns#managedZoneCloudLoggingConfig
      creationTime: '2024-01-15T07:22:55.199Z'
      description: Created By OpenShift Installer
      dnsName: jiwei-0115y.qe.gcp.devcluster.openshift.com.
      id: '9193862213315831261'
      kind: dns#managedZone
      labels:
        kubernetes-io-cluster-jiwei-0115y-lgns8: owned
      name: jiwei-0115y-lgns8-private-zone
      nameServers:
      - ns-gcp-private.googledomains.com.
      privateVisibilityConfig:
        kind: dns#managedZonePrivateVisibilityConfig
        networks:
        - kind: dns#managedZonePrivateVisibilityConfigNetwork
          networkUrl: https://www.googleapis.com/compute/v1/projects/openshift-qe/global/networks/jiwei-0115y-lgns8-network
      visibility: private
      $ gcloud dns record-sets list --zone jiwei-0115y-lgns8-private-zone
      NAME                                          TYPE  TTL    DATA
      jiwei-0115y.qe.gcp.devcluster.openshift.com.  NS    21600  ns-gcp-private.googledomains.com.
      jiwei-0115y.qe.gcp.devcluster.openshift.com.  SOA   21600  ns-gcp-private.googledomains.com. cloud-dns-hostmaster.google.com. 1 21600 3600 259200 300
      $ gcloud dns record-sets list --zone qe --filter='name~jiwei-0115y'
      Listed 0 items.
      $ 
      

            [OCPBUGS-27156] [gcp] destroying the problem cluster unexpectedly deletes the dns record-sets not created by the installer

            Since the problem described in this issue should be resolved in a recent advisory, it has been closed.

            For information on the advisory (Critical: OpenShift Container Platform 4.16.0 bug fix and security update), and where to find the updated files, follow the link below.

            If the solution does not work for you, open a new bug report.
            https://access.redhat.com/errata/RHSA-2024:0041

            Errata Tool added a comment - Since the problem described in this issue should be resolved in a recent advisory, it has been closed. For information on the advisory (Critical: OpenShift Container Platform 4.16.0 bug fix and security update), and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHSA-2024:0041

            Jianli Wei added a comment -

            Verified with 4.16.0-0.nightly-2024-02-29-062601.

            The verification steps:

            (1) the existing cluster of cluster name "jliu415"

            $ oc get clusterversion
            NAME      VERSION   AVAILABLE   PROGRESSING   SINCE   STATUS
            version   4.15.0    True        False         3h3m    Cluster version is 4.15.0
            $ oc get nodes
            NAME                                                           STATUS   ROLES                  AGE     VERSION
            jliu415-94lzh-master-0.us-central1-a.c.openshift-qe.internal   Ready    control-plane,master   3h29m   v1.28.6+6216ea1
            jliu415-94lzh-master-1.us-central1-b.c.openshift-qe.internal   Ready    control-plane,master   3h29m   v1.28.6+6216ea1
            jliu415-94lzh-master-2.us-central1-c.c.openshift-qe.internal   Ready    control-plane,master   3h29m   v1.28.6+6216ea1
            jliu415-94lzh-worker-a-t9qt9                                   Ready    worker                 3h11m   v1.28.6+6216ea1
            jliu415-94lzh-worker-b-grmg8                                   Ready    worker                 3h14m   v1.28.6+6216ea1
            jliu415-94lzh-worker-c-q6lzv                                   Ready    worker                 3h14m   v1.28.6+6216ea1
            $ oc get co
            NAME                                       VERSION   AVAILABLE   PROGRESSING   DEGRADED   SINCE   MESSAGE
            authentication                             4.15.0    True        False         False      3h3m
            baremetal                                  4.15.0    True        False         False      3h23m
            cloud-controller-manager                   4.15.0    True        False         False      3h28m
            cloud-credential                           4.15.0    True        False         False      3h35m
            cluster-autoscaler                         4.15.0    True        False         False      3h23m
            config-operator                            4.15.0    True        False         False      3h24m
            console                                    4.15.0    True        False         False      3h7m
            control-plane-machine-set                  4.15.0    True        False         False      3h13m
            csi-snapshot-controller                    4.15.0    True        False         False      3h24m
            dns                                        4.15.0    True        False         False      3h22m
            etcd                                       4.15.0    True        False         False      3h19m
            image-registry                             4.15.0    True        False         False      3h10m
            ingress                                    4.15.0    True        False         False      3h11m
            insights                                   4.15.0    True        False         False      3h17m
            kube-apiserver                             4.15.0    True        False         False      3h13m
            kube-controller-manager                    4.15.0    True        False         False      3h19m
            kube-scheduler                             4.15.0    True        False         False      3h17m
            kube-storage-version-migrator              4.15.0    True        False         False      3h24m
            machine-api                                4.15.0    True        False         False      3h12m
            machine-approver                           4.15.0    True        False         False      3h23m
            machine-config                             4.15.0    True        False         False      3h19m
            marketplace                                4.15.0    True        False         False      3h23m
            monitoring                                 4.15.0    True        False         False      3h3m
            network                                    4.15.0    True        False         False      3h25m
            node-tuning                                4.15.0    True        False         False      3h11m
            openshift-apiserver                        4.15.0    True        False         False      3h11m
            openshift-controller-manager               4.15.0    True        False         False      3h23m
            openshift-samples                          4.15.0    True        False         False      3h10m
            operator-lifecycle-manager                 4.15.0    True        False         False      3h23m
            operator-lifecycle-manager-catalog         4.15.0    True        False         False      3h23m
            operator-lifecycle-manager-packageserver   4.15.0    True        False         False      3h11m
            service-ca                                 4.15.0    True        False         False      3h24m
            storage                                    4.15.0    True        False         False      3h23m
            $  

            (2) try to create another cluster of the same cluster name, which failed as expected

            $ openshift-install version
            openshift-install 4.16.0-0.nightly-2024-02-29-062601
            built from commit b8d567115ff6a93ad048808fa4c8a52eb78fe954
            release image registry.ci.openshift.org/ocp/release@sha256:17e26f77d2b5cb553cbc07c6322062ef9a4162a89e89c785522c1021be83725c
            release architecture amd64
            $ yq-3.3.0 r test2/install-config.yaml platform
            gcp:
              projectID: openshift-qe
              region: us-central1
            $ yq-3.3.0 r test2/install-config.yaml metadata
            creationTimestamp: null
            name: jliu415
            $ yq-3.3.0 r test2/install-config.yaml baseDomain
            qe.gcp.devcluster.openshift.com
            $ openshift-install create cluster --dir test2
            INFO Credentials loaded from file "/home/fedora/.gcp/osServiceAccount.json" 
            WARNING Making control-plane schedulable by setting MastersSchedulable to true for Scheduler cluster settings 
            INFO Consuming Install Config from target directory 
            FATAL failed to fetch Terraform Variables: failed to fetch dependency of "Terraform Variables": failed to generate asset "Platform Provisioning Check": metadata.name: Invalid value: "jliu415": record(s) ["api.jliu415.qe.gcp.devcluster.openshift.com."] already exists in DNS Zone (openshift-qe/qe) and might be in use by another cluster, please remove it to continue 
            $  

            (3) destroy the above cluster

            $ openshift-install destroy cluster --dir test2
            INFO Credentials loaded from file "/home/fedora/.gcp/osServiceAccount.json" 
            INFO Time elapsed: 41s                            
            INFO Uninstallation complete!                     
            $  

            (4) make sure the step1 cluster is still healthy

            $ oc get clusterversion
            NAME      VERSION   AVAILABLE   PROGRESSING   SINCE   STATUS
            version   4.15.0    True        False         3h8m    Cluster version is 4.15.0
            $ oc get nodes
            NAME                                                           STATUS   ROLES                  AGE     VERSION
            jliu415-94lzh-master-0.us-central1-a.c.openshift-qe.internal   Ready    control-plane,master   3h35m   v1.28.6+6216ea1
            jliu415-94lzh-master-1.us-central1-b.c.openshift-qe.internal   Ready    control-plane,master   3h34m   v1.28.6+6216ea1
            jliu415-94lzh-master-2.us-central1-c.c.openshift-qe.internal   Ready    control-plane,master   3h35m   v1.28.6+6216ea1
            jliu415-94lzh-worker-a-nj65z                                   Ready    worker                 2m8s    v1.28.6+6216ea1
            jliu415-94lzh-worker-b-grmg8                                   Ready    worker                 3h19m   v1.28.6+6216ea1
            jliu415-94lzh-worker-c-q6lzv                                   Ready    worker                 3h19m   v1.28.6+6216ea1
            $ 
            $ gcloud dns record-sets list --zone jliu415-94lzh-private-zone
            NAME                                              TYPE  TTL    DATA
            jliu415.qe.gcp.devcluster.openshift.com.          NS    21600  ns-gcp-private.googledomains.com.
            jliu415.qe.gcp.devcluster.openshift.com.          SOA   21600  ns-gcp-private.googledomains.com. cloud-dns-hostmaster.google.com. 1 21600 3600 259200 300
            api.jliu415.qe.gcp.devcluster.openshift.com.      A     60     10.0.0.4
            api-int.jliu415.qe.gcp.devcluster.openshift.com.  A     60     10.0.0.4
            *.apps.jliu415.qe.gcp.devcluster.openshift.com.   A     30     35.239.27.32
            $  

            Jianli Wei added a comment - Verified with 4.16.0-0.nightly-2024-02-29-062601 . The verification steps: (1) the existing cluster of cluster name "jliu415" $ oc get clusterversion NAME      VERSION   AVAILABLE   PROGRESSING   SINCE   STATUS version   4.15.0    True        False         3h3m    Cluster version is 4.15.0 $ oc get nodes NAME                                                           STATUS   ROLES                  AGE     VERSION jliu415-94lzh-master-0.us-central1-a.c.openshift-qe.internal   Ready    control-plane,master   3h29m   v1.28.6+6216ea1 jliu415-94lzh-master-1.us-central1-b.c.openshift-qe.internal   Ready    control-plane,master   3h29m   v1.28.6+6216ea1 jliu415-94lzh-master-2.us-central1-c.c.openshift-qe.internal   Ready    control-plane,master   3h29m   v1.28.6+6216ea1 jliu415-94lzh-worker-a-t9qt9                                   Ready    worker                 3h11m   v1.28.6+6216ea1 jliu415-94lzh-worker-b-grmg8                                   Ready    worker                 3h14m   v1.28.6+6216ea1 jliu415-94lzh-worker-c-q6lzv                                   Ready    worker                 3h14m   v1.28.6+6216ea1 $ oc get co NAME                                       VERSION   AVAILABLE   PROGRESSING   DEGRADED   SINCE   MESSAGE authentication                             4.15.0    True        False         False      3h3m baremetal                                  4.15.0    True        False         False      3h23m cloud-controller-manager                   4.15.0    True        False         False      3h28m cloud-credential                           4.15.0    True        False         False      3h35m cluster-autoscaler                         4.15.0    True        False         False      3h23m config- operator                            4.15.0    True        False         False      3h24m console                                    4.15.0    True        False         False      3h7m control-plane-machine-set                  4.15.0    True        False         False      3h13m csi-snapshot-controller                    4.15.0    True        False         False      3h24m dns                                        4.15.0    True        False         False      3h22m etcd                                       4.15.0    True        False         False      3h19m image-registry                             4.15.0    True        False         False      3h10m ingress                                    4.15.0    True        False         False      3h11m insights                                   4.15.0    True        False         False      3h17m kube-apiserver                             4.15.0    True        False         False      3h13m kube-controller-manager                    4.15.0    True        False         False      3h19m kube-scheduler                             4.15.0    True        False         False      3h17m kube-storage-version-migrator              4.15.0    True        False         False      3h24m machine-api                                4.15.0    True        False         False      3h12m machine-approver                           4.15.0    True        False         False      3h23m machine-config                             4.15.0    True        False         False      3h19m marketplace                                4.15.0    True        False         False      3h23m monitoring                                 4.15.0    True        False         False      3h3m network                                    4.15.0    True        False         False      3h25m node-tuning                                4.15.0    True        False         False      3h11m openshift-apiserver                        4.15.0    True        False         False      3h11m openshift-controller-manager               4.15.0    True        False         False      3h23m openshift-samples                          4.15.0    True        False         False      3h10m operator -lifecycle-manager                 4.15.0    True        False         False      3h23m operator -lifecycle-manager-catalog         4.15.0    True        False         False      3h23m operator -lifecycle-manager-packageserver   4.15.0    True        False         False      3h11m service-ca                                 4.15.0    True        False         False      3h24m storage                                    4.15.0    True        False         False      3h23m $  (2) try to create another cluster of the same cluster name, which failed as expected $ openshift-install version openshift-install 4.16.0-0.nightly-2024-02-29-062601 built from commit b8d567115ff6a93ad048808fa4c8a52eb78fe954 release image registry.ci.openshift.org/ocp/release@sha256:17e26f77d2b5cb553cbc07c6322062ef9a4162a89e89c785522c1021be83725c release architecture amd64 $ yq-3.3.0 r test2/install-config.yaml platform gcp:   projectID: openshift-qe   region: us-central1 $ yq-3.3.0 r test2/install-config.yaml metadata creationTimestamp: null name: jliu415 $ yq-3.3.0 r test2/install-config.yaml baseDomain qe.gcp.devcluster.openshift.com $ openshift-install create cluster --dir test2 INFO Credentials loaded from file "/home/fedora/.gcp/osServiceAccount.json"   WARNING Making control-plane schedulable by setting MastersSchedulable to true for Scheduler cluster settings  INFO Consuming Install Config from target directory  FATAL failed to fetch Terraform Variables: failed to fetch dependency of "Terraform Variables" : failed to generate asset "Platform Provisioning Check" : metadata.name: Invalid value: "jliu415" : record(s) [ "api.jliu415.qe.gcp.devcluster.openshift.com." ] already exists in DNS Zone (openshift-qe/qe) and might be in use by another cluster, please remove it to continue   $  (3) destroy the above cluster $ openshift-install destroy cluster --dir test2 INFO Credentials loaded from file "/home/fedora/.gcp/osServiceAccount.json"   INFO Time elapsed: 41s                             INFO Uninstallation complete!                      $  (4) make sure the step1 cluster is still healthy $ oc get clusterversion NAME      VERSION   AVAILABLE   PROGRESSING   SINCE   STATUS version   4.15.0    True        False         3h8m    Cluster version is 4.15.0 $ oc get nodes NAME                                                           STATUS   ROLES                  AGE     VERSION jliu415-94lzh-master-0.us-central1-a.c.openshift-qe.internal   Ready    control-plane,master   3h35m   v1.28.6+6216ea1 jliu415-94lzh-master-1.us-central1-b.c.openshift-qe.internal   Ready    control-plane,master   3h34m   v1.28.6+6216ea1 jliu415-94lzh-master-2.us-central1-c.c.openshift-qe.internal   Ready    control-plane,master   3h35m   v1.28.6+6216ea1 jliu415-94lzh-worker-a-nj65z                                   Ready    worker                 2m8s    v1.28.6+6216ea1 jliu415-94lzh-worker-b-grmg8                                   Ready    worker                 3h19m   v1.28.6+6216ea1 jliu415-94lzh-worker-c-q6lzv                                   Ready    worker                 3h19m   v1.28.6+6216ea1 $  $ gcloud dns record-sets list --zone jliu415-94lzh- private -zone NAME                                              TYPE  TTL    DATA jliu415.qe.gcp.devcluster.openshift.com.          NS    21600  ns-gcp- private .googledomains.com. jliu415.qe.gcp.devcluster.openshift.com.          SOA   21600  ns-gcp- private .googledomains.com. cloud-dns-hostmaster.google.com. 1 21600 3600 259200 300 api.jliu415.qe.gcp.devcluster.openshift.com.      A     60     10.0.0.4 api- int .jliu415.qe.gcp.devcluster.openshift.com.  A     60     10.0.0.4 *.apps.jliu415.qe.gcp.devcluster.openshift.com.   A     30     35.239.27.32 $ 

            Hi rh-ee-bbarbach,

            Bugs should not be moved to Verified without first providing a Release Note Type("Bug Fix" or "No Doc Update") and for type "Bug Fix" the Release Note Text must also be provided. Please populate the necessary fields before moving the Bug to Verified.

            OpenShift Jira Bot added a comment - Hi rh-ee-bbarbach , Bugs should not be moved to Verified without first providing a Release Note Type("Bug Fix" or "No Doc Update") and for type "Bug Fix" the Release Note Text must also be provided. Please populate the necessary fields before moving the Bug to Verified.

            Rejected release blocker since there is at least one work around. Users can fully delete the cluster before attempting a new install OR users can change the name of the cluster.

            Brent Barbachem added a comment - Rejected release blocker since there is at least one work around. Users can fully delete the cluster before attempting a new install OR users can change the name of the cluster.

            rhn-support-jiwei my best bet about the issue in [1] is https://github.com/openshift/installer/pull/7924. So I new flag is used during the bootstrap bootkube and it might not be available with older images. I'd expect a line like this in the bootkube log:

            Jan 23 09:46:39 ip-10-0-185-31 bootkube.sh[3412]: flag provided but not defined: -cluster-profile

            Rafael Fonseca dos Santos added a comment - - edited rhn-support-jiwei my best bet about the issue in [1] is https://github.com/openshift/installer/pull/7924. So I new flag is used during the bootstrap bootkube and it might not be available with older images. I'd expect a line like this in the bootkube log: Jan 23 09:46:39 ip-10-0-185-31 bootkube.sh[3412]: flag provided but not defined: -cluster-profile

            Looks like this bug is far enough along in the workflow that a code fix is ready. Customers and support need to know the backport plan. Please complete the "Target Backport Versions" field to indicate which version(s) will receive the fix.

            OpenShift Jira Bot added a comment - Looks like this bug is far enough along in the workflow that a code fix is ready. Customers and support need to know the backport plan. Please complete the " Target Backport Versions " field to indicate which version(s) will receive the fix.

            AWS checks the private hosted zone (which can be tagged) to ensure that the tag has the infra ID for the cluster intended to be destroyed. We should look to see if we can use this same logic to prevent this scenario.

            Patrick Dillon added a comment - AWS checks the private hosted zone (which can be tagged) to ensure that the tag has the infra ID for the cluster intended to be destroyed. We should look to see if we can use this same logic to prevent this scenario.

              rh-ee-bbarbach Brent Barbachem
              rhn-support-jiwei Jianli Wei
              Jianli Wei Jianli Wei
              Votes:
              0 Vote for this issue
              Watchers:
              8 Start watching this issue

                Created:
                Updated:
                Resolved: