-
Bug
-
Resolution: Duplicate
-
Critical
-
None
-
4.16, 4.17
Description of problem:
Installation in us-gov-west-1 region failed, because api lb dns record on us-gov-west-1 TTL is set to 15 mins, while installer can not wait more, so failed.
Version-Release number of selected component (if applicable):
4.16.0-0.nightly-2024-06-20-005834
How reproducible:
Always on local testing (not in prow)
Steps to Reproduce:
1. Install a cluster in us-gov-west-1 region 2. 3.
Actual results:
06-21 09:01:59.479 level=debug msg=E0621 01:01:59.438140 1914 awscluster_controller.go:293] "failed to get IP address for dns name" err="lookup yunjiang-21gov-j8hs8-int-cdb3e502b39f43af.elb.us-gov-west-1.amazonaws.com on 172.27.0.10:53: no such host" controller="awscluster" controllerGroup="infrastructure.cluster.x-k8s.io" controllerKind="AWSCluster" AWSCluster="openshift-cluster-api-guests/yunjiang-21gov-j8hs8" namespace="openshift-cluster-api-guests" name="yunjiang-21gov-j8hs8" reconcileID="7754d345-a5de-4617-9a48-221adc938618" cluster="openshift-cluster-api-guests/yunjiang-21gov-j8hs8" dns="yunjiang-21gov-j8hs8-int-cdb3e502b39f43af.elb.us-gov-west-1.amazonaws.com" 06-21 09:01:59.479 level=debug msg=I0621 01:01:59.438178 1914 awscluster_controller.go:295] "Waiting on API server ELB DNS name to resolve" controller="awscluster" controllerGroup="infrastructure.cluster.x-k8s.io" controllerKind="AWSCluster" AWSCluster="openshift-cluster-api-guests/yunjiang-21gov-j8hs8" namespace="openshift-cluster-api-guests" name="yunjiang-21gov-j8hs8" reconcileID="7754d345-a5de-4617-9a48-221adc938618" cluster="openshift-cluster-api-guests/yunjiang-21gov-j8hs8" ... ... 06-21 09:16:21.049 level=debug msg=E0621 01:16:20.908679 1914 awscluster_controller.go:293] "failed to get IP address for dns name" err="lookup yunjiang-21gov-j8hs8-int-cdb3e502b39f43af.elb.us-gov-west-1.amazonaws.com on 172.27.0.10:53: no such host" controller="awscluster" controllerGroup="infrastructure.cluster.x-k8s.io" controllerKind="AWSCluster" AWSCluster="openshift-cluster-api-guests/yunjiang-21gov-j8hs8" namespace="openshift-cluster-api-guests" name="yunjiang-21gov-j8hs8" reconcileID="003be973-1e73-4fe8-88e8-894995d5517c" cluster="openshift-cluster-api-guests/yunjiang-21gov-j8hs8" dns="yunjiang-21gov-j8hs8-int-cdb3e502b39f43af.elb.us-gov-west-1.amazonaws.com" 06-21 09:16:21.050 level=debug msg=I0621 01:16:20.908716 1914 awscluster_controller.go:295] "Waiting on API server ELB DNS name to resolve" controller="awscluster" controllerGroup="infrastructure.cluster.x-k8s.io" controllerKind="AWSCluster" AWSCluster="openshift-cluster-api-guests/yunjiang-21gov-j8hs8" namespace="openshift-cluster-api-guests" name="yunjiang-21gov-j8hs8" reconcileID="003be973-1e73-4fe8-88e8-894995d5517c" cluster="openshift-cluster-api-guests/yunjiang-21gov-j8hs8" 06-21 09:16:21.990 level=debug msg=Collecting applied cluster api manifests... 06-21 09:16:21.990 level=error msg=failed to fetch Cluster: failed to generate asset "Cluster": failed to create cluster: infrastructure was not ready within 15m0s: client rate limiter Wait returned an error: context deadline exceeded
From the timestamp, installer waited for ~15 mins to ensure LB dns can be resolved.
While unfortunately the local resolver cache TTL is also set 15 mins, once any slight delay happened, the installation would fail.
Expected results:
cluster install in us-gov-west-1 region get passed.
Additional info:
Here is my local testing, create a LB in us-gov-west-1 region, record its DNS. Immediately test if it can be resolved locally. $ dig jialiu-735f8ebd8bc9a14e.elb.us-gov-west-1.amazonaws.com; <<>> DiG 9.11.36-RedHat-9.11.36-5.el8_7.2 <<>> jialiu-735f8ebd8bc9a14e.elb.us-gov-west-1.amazonaws.com ;; global options: +cmd ;; Got answer: ;; ->>HEADER<<- opcode: QUERY, status: NXDOMAIN, id: 27112 ;; flags: qr rd ra; QUERY: 1, ANSWER: 0, AUTHORITY: 1, ADDITIONAL: 1;; OPT PSEUDOSECTION: ; EDNS: version: 0, flags:; udp: 1220 ; COOKIE: 2caaa84d7bd004acf1e8a4aa667508daacf8365c3d922552 (good) ;; QUESTION SECTION: ;jialiu-735f8ebd8bc9a14e.elb.us-gov-west-1.amazonaws.com. IN A;; AUTHORITY SECTION: elb.us-gov-west-1.amazonaws.com. 900 IN SOA ns-1151.awsdns-15.org. awsdns-hostmaster.amazon.com. 1 7200 900 1209600 60;; Query time: 12 msec ;; SERVER: 10.11.5.160#53(10.11.5.160) ;; WHEN: Fri Jun 21 01:00:10 EDT 2024 ;; MSG SIZE rcvd: 194 From the above output, the local resolve cache TTL is 900 seconds (15 mins), so subsequent request to the DNS is always answered with “no such host”, because it is using the resolver cache until TTL expired. Also run the same testing in us-gov-east-1 region, TTL of the DNS in us-gov-east-1 is set to 60. $ dig jialiu2test-d3b48638d23e6c40.elb.us-gov-east-1.amazonaws.com; <<>> DiG 9.11.36-RedHat-9.11.36-5.el8_7.2 <<>> jialiu2test-d3b48638d23e6c40.elb.us-gov-east-1.amazonaws.com ;; global options: +cmd ;; Got answer: ;; ->>HEADER<<- opcode: QUERY, status: NXDOMAIN, id: 15111 ;; flags: qr rd ra; QUERY: 1, ANSWER: 0, AUTHORITY: 1, ADDITIONAL: 1;; OPT PSEUDOSECTION: ; EDNS: version: 0, flags:; udp: 1220 ; COOKIE: b9533f9fdbdece78d4c4502d66750b09619c5c34294a8093 (good) ;; QUESTION SECTION: ;jialiu2test-d3b48638d23e6c40.elb.us-gov-east-1.amazonaws.com. IN A;; AUTHORITY SECTION: elb.us-gov-east-1.amazonaws.com. 60 IN SOA ns-604.awsdns-11.net. awsdns-hostmaster.amazon.com. 1 7200 900 1209600 60;; Query time: 11 msec ;; SERVER: 10.11.5.160#53(10.11.5.160) ;; WHEN: Fri Jun 21 01:09:29 EDT 2024 ;; MSG SIZE rcvd: 198 So that explained the install in us-gov-east-1 always succeed. Actually, whatever us-gov-east-1 or us-gov-west-1, the newly created DNS costs 1~2 mins to propagate to internet, then to local resolver. So another point to speed up the install is sleeping 120 seconds before the 1st query to the new DNS. So maybe two places we can fix the issue: 1. Change retryAfterDuration from 15 to 16 (or 20) at https://github.com/openshift/installer/blob/master/cluster-api/providers/aws/vendor/sigs.k8s.io/cluster-api-provider-aws/v2/controllers/awscluster_controller.go#L270 2. sleep 120s before the 1st query at https://github.com/openshift/installer/blob/master/cluster-api/providers/aws/vendor/sigs.k8s.io/cluster-api-provider-aws/v2/controllers/awscluster_controller.go#L292
- duplicates
-
OCPBUGS-36222 AWS Installs Fail when Installer Host cannot resolve LB DNS Name
- Verified
- relates to
-
OCPBUGS-36222 AWS Installs Fail when Installer Host cannot resolve LB DNS Name
- Verified
- links to