Uploaded image for project: 'OpenShift Bugs'
  1. OpenShift Bugs
  2. OCPBUGS-35898

[CAPI] Installation in us-gov-west-1 region failed because api lb DNS resolver timeout

XMLWordPrintable

    • Important
    • No
    • Installer Sprint 256, Installer Sprint 257, Installer (PB) Sprint 258, Installer (PB) Sprint 259, Installer Sprint 260, Installer Sprint 261
    • 6
    • Rejected
    • False
    • Hide

      None

      Show
      None
    • Hide
      What: in some cases the Load Balancer DNS name resolution check would happen too early, causing the install to time out.
      Fix: Wait for DNS name to propagate before attempting to resolve it.
      Show
      What: in some cases the Load Balancer DNS name resolution check would happen too early, causing the install to time out. Fix: Wait for DNS name to propagate before attempting to resolve it.
    • Bug Fix
    • In Progress

      Description of problem:

      Installation in us-gov-west-1 region failed, because api lb dns record on us-gov-west-1 TTL is set to 15 mins, while installer can not wait more, so failed.

      Version-Release number of selected component (if applicable):

      4.16.0-0.nightly-2024-06-20-005834

      How reproducible:

      Always on local testing (not in prow)

      Steps to Reproduce:

      1. Install a cluster in us-gov-west-1 region
      2.
      3.
      

      Actual results:

      06-21 09:01:59.479  level=debug msg=E0621 01:01:59.438140    1914 awscluster_controller.go:293] "failed to get IP address for dns name" err="lookup yunjiang-21gov-j8hs8-int-cdb3e502b39f43af.elb.us-gov-west-1.amazonaws.com on 172.27.0.10:53: no such host" controller="awscluster" controllerGroup="infrastructure.cluster.x-k8s.io" controllerKind="AWSCluster" AWSCluster="openshift-cluster-api-guests/yunjiang-21gov-j8hs8" namespace="openshift-cluster-api-guests" name="yunjiang-21gov-j8hs8" reconcileID="7754d345-a5de-4617-9a48-221adc938618" cluster="openshift-cluster-api-guests/yunjiang-21gov-j8hs8" dns="yunjiang-21gov-j8hs8-int-cdb3e502b39f43af.elb.us-gov-west-1.amazonaws.com"
      06-21 09:01:59.479  level=debug msg=I0621 01:01:59.438178    1914 awscluster_controller.go:295] "Waiting on API server ELB DNS name to resolve" controller="awscluster" controllerGroup="infrastructure.cluster.x-k8s.io" controllerKind="AWSCluster" AWSCluster="openshift-cluster-api-guests/yunjiang-21gov-j8hs8" namespace="openshift-cluster-api-guests" name="yunjiang-21gov-j8hs8" reconcileID="7754d345-a5de-4617-9a48-221adc938618" cluster="openshift-cluster-api-guests/yunjiang-21gov-j8hs8"
      ...
      ...
      06-21 09:16:21.049  level=debug msg=E0621 01:16:20.908679    1914 awscluster_controller.go:293] "failed to get IP address for dns name" err="lookup yunjiang-21gov-j8hs8-int-cdb3e502b39f43af.elb.us-gov-west-1.amazonaws.com on 172.27.0.10:53: no such host" controller="awscluster" controllerGroup="infrastructure.cluster.x-k8s.io" controllerKind="AWSCluster" AWSCluster="openshift-cluster-api-guests/yunjiang-21gov-j8hs8" namespace="openshift-cluster-api-guests" name="yunjiang-21gov-j8hs8" reconcileID="003be973-1e73-4fe8-88e8-894995d5517c" cluster="openshift-cluster-api-guests/yunjiang-21gov-j8hs8" dns="yunjiang-21gov-j8hs8-int-cdb3e502b39f43af.elb.us-gov-west-1.amazonaws.com"
      06-21 09:16:21.050  level=debug msg=I0621 01:16:20.908716    1914 awscluster_controller.go:295] "Waiting on API server ELB DNS name to resolve" controller="awscluster" controllerGroup="infrastructure.cluster.x-k8s.io" controllerKind="AWSCluster" AWSCluster="openshift-cluster-api-guests/yunjiang-21gov-j8hs8" namespace="openshift-cluster-api-guests" name="yunjiang-21gov-j8hs8" reconcileID="003be973-1e73-4fe8-88e8-894995d5517c" cluster="openshift-cluster-api-guests/yunjiang-21gov-j8hs8"
      06-21 09:16:21.990  level=debug msg=Collecting applied cluster api manifests...
      06-21 09:16:21.990  level=error msg=failed to fetch Cluster: failed to generate asset "Cluster": failed to create cluster: infrastructure was not ready within 15m0s: client rate limiter Wait returned an error: context deadline exceeded

      From the timestamp, installer waited for ~15 mins to ensure LB dns can be resolved.

      https://github.com/openshift/installer/blob/master/cluster-api/providers/aws/vendor/sigs.k8s.io/cluster-api-provider-aws/v2/controllers/awscluster_controller.go#L270

      While unfortunately the local resolver cache TTL is also set 15 mins, once any slight delay happened, the installation would fail.  

      Expected results:

      cluster install in us-gov-west-1 region get passed.

      Additional info:

      Here is my local testing, create a LB in us-gov-west-1 region, record its DNS. Immediately test if it can be resolved locally.
      
      $ dig jialiu-735f8ebd8bc9a14e.elb.us-gov-west-1.amazonaws.com; <<>> DiG 9.11.36-RedHat-9.11.36-5.el8_7.2 <<>> jialiu-735f8ebd8bc9a14e.elb.us-gov-west-1.amazonaws.com
      ;; global options: +cmd
      ;; Got answer:
      ;; ->>HEADER<<- opcode: QUERY, status: NXDOMAIN, id: 27112
      ;; flags: qr rd ra; QUERY: 1, ANSWER: 0, AUTHORITY: 1, ADDITIONAL: 1;; OPT PSEUDOSECTION:
      ; EDNS: version: 0, flags:; udp: 1220
      ; COOKIE: 2caaa84d7bd004acf1e8a4aa667508daacf8365c3d922552 (good)
      ;; QUESTION SECTION:
      ;jialiu-735f8ebd8bc9a14e.elb.us-gov-west-1.amazonaws.com. IN A;; AUTHORITY SECTION:
      elb.us-gov-west-1.amazonaws.com. 900 IN    SOA    ns-1151.awsdns-15.org. awsdns-hostmaster.amazon.com. 1 7200 900 1209600 60;; Query time: 12 msec
      ;; SERVER: 10.11.5.160#53(10.11.5.160)
      ;; WHEN: Fri Jun 21 01:00:10 EDT 2024
      ;; MSG SIZE  rcvd: 194
      
      From the above output, the local resolve cache TTL is 900 seconds (15 mins), so subsequent request to the DNS is always answered with “no such host”, because it is using the resolver cache until TTL expired.
      
      Also run the same testing in us-gov-east-1 region, TTL of the DNS in us-gov-east-1 is set to 60.
      
      $ dig jialiu2test-d3b48638d23e6c40.elb.us-gov-east-1.amazonaws.com; <<>> DiG 9.11.36-RedHat-9.11.36-5.el8_7.2 <<>> jialiu2test-d3b48638d23e6c40.elb.us-gov-east-1.amazonaws.com
      ;; global options: +cmd
      ;; Got answer:
      ;; ->>HEADER<<- opcode: QUERY, status: NXDOMAIN, id: 15111
      ;; flags: qr rd ra; QUERY: 1, ANSWER: 0, AUTHORITY: 1, ADDITIONAL: 1;; OPT PSEUDOSECTION:
      ; EDNS: version: 0, flags:; udp: 1220
      ; COOKIE: b9533f9fdbdece78d4c4502d66750b09619c5c34294a8093 (good)
      ;; QUESTION SECTION:
      ;jialiu2test-d3b48638d23e6c40.elb.us-gov-east-1.amazonaws.com. IN A;; AUTHORITY SECTION:
      elb.us-gov-east-1.amazonaws.com. 60 IN    SOA    ns-604.awsdns-11.net. awsdns-hostmaster.amazon.com. 1 7200 900 1209600 60;; Query time: 11 msec
      ;; SERVER: 10.11.5.160#53(10.11.5.160)
      ;; WHEN: Fri Jun 21 01:09:29 EDT 2024
      ;; MSG SIZE  rcvd: 198
      
      So that explained the install in us-gov-east-1 always succeed.
      
      Actually, whatever us-gov-east-1 or us-gov-west-1, the newly created DNS costs 1~2 mins to propagate to internet, then to local resolver.
      
      So another point to speed up the install is sleeping 120 seconds before the 1st query to the new DNS.
      
      So maybe two places we can fix the issue:
      1. Change retryAfterDuration from 15 to 16 (or 20) at https://github.com/openshift/installer/blob/master/cluster-api/providers/aws/vendor/sigs.k8s.io/cluster-api-provider-aws/v2/controllers/awscluster_controller.go#L270
      2. sleep 120s before the 1st query at https://github.com/openshift/installer/blob/master/cluster-api/providers/aws/vendor/sigs.k8s.io/cluster-api-provider-aws/v2/controllers/awscluster_controller.go#L292

              rdossant Rafael Fonseca dos Santos
              jialiu@redhat.com Johnny Liu
              Johnny Liu Johnny Liu
              Votes:
              0 Vote for this issue
              Watchers:
              5 Start watching this issue

                Created:
                Updated:
                Resolved: