-
Bug
-
Resolution: Not a Bug
-
Normal
-
None
-
4.16.0
-
+
-
Important
-
No
-
False
-
Description of problem:
OCP cannot deploy workers because machine-controller is resolving the OSP domain to a wrong IP:
[stack@undercloud-0 ~]$ oc -c machine-controller logs machine-api-controllers-869dfb489f-cj46t [...] E0701 11:38:38.094182 1 controller.go:329] "msg"="Reconciler error" "error"="failed to get InstanceService: Failed to authenticate provider client: Get \"https://overcloud.redhat.local:13000/\": dial tcp 13.248.169.48:13000: connect: connection timed out" "MachineSet"={"name":"ostest-jl84g-worker-0","namespace":"openshift-machine-api"} "controller"="machineset" "controllerGroup"="machine.openshift.io" "controllerKind"="MachineSet" "name"="ostest-jl84g-worker-0" "namespace"="openshift-machine-api" "reconcileID"="02fe9c4b-5206-4b42-971a-b2c02efa22d6"
as a consequence, no workers are deployed.
The IP that is resolving to overcloud.redhat.local is wrong. It should not be 13.248.169.48 but 10.46.43.94
After connecting to the machine-controller pod, we observed that the wrong IP is obtained by appending the suffix "shiftstack.com" to the url:
(shiftstack) [stack@undercloud-0 ~]$ oc -c machine-controller rsh machine-api-controllers-869dfb489f-cj46t sh-5.1$ dig overcloud.redhat.local ; <<>> DiG 9.16.23-RH <<>> overcloud.redhat.local ;; global options: +cmd ;; Got answer: ;; WARNING: .local is reserved for Multicast DNS ;; You are currently testing what happens when an mDNS query is leaked to DNS ;; ->>HEADER<<- opcode: QUERY, status: NOERROR, id: 57703 ;; flags: qr aa rd ra; QUERY: 1, ANSWER: 1, AUTHORITY: 0, ADDITIONAL: 1;; OPT PSEUDOSECTION: ; EDNS: version: 0, flags:; udp: 1232 ; COOKIE: 6011641699a9d37e (echoed) ;; QUESTION SECTION: ;overcloud.redhat.local. IN A;; ANSWER SECTION: overcloud.redhat.local. 30 IN A 10.46.43.94 // <-- Correct IP ;; Query time: 59 msec ;; SERVER: 172.30.0.10#53(172.30.0.10) ;; WHEN: Mon Jul 01 11:42:25 UTC 2024 ;; MSG SIZE rcvd: 101 sh-5.1$ dig overcloud.redhat.local.shiftstack.com ; <<>> DiG 9.16.23-RH <<>> overcloud.redhat.local.shiftstack.com ;; global options: +cmd ;; Got answer: ;; ->>HEADER<<- opcode: QUERY, status: NOERROR, id: 58703 ;; flags: qr rd ra; QUERY: 1, ANSWER: 2, AUTHORITY: 0, ADDITIONAL: 1;; OPT PSEUDOSECTION: ; EDNS: version: 0, flags:; udp: 1232 ; COOKIE: fd9fd565b45a0930 (echoed) ;; QUESTION SECTION: ;overcloud.redhat.local.shiftstack.com. IN A;; ANSWER SECTION: overcloud.redhat.local.shiftstack.com. 30 IN A 76.223.54.146 overcloud.redhat.local.shiftstack.com. 30 IN A 13.248.169.48 // <-- Wrong IP ;; Query time: 4 msec ;; SERVER: 172.30.0.10#53(172.30.0.10) ;; WHEN: Mon Jul 01 11:42:33 UTC 2024 ;; MSG SIZE rcvd: 184
Version-Release number of selected component (if applicable):
4.16.0 RHOS-17.1-RHEL-9-20240516.n.1
Last time we saw it passing is with (RHOS-17.1-RHEL-9-20240516.n.1, 4.16.0-rc.5). Therefore it looks the bug was introduced at some point between 4.16.0-rc.5 and 4.16.0.
How reproducible:
Always on OSP deployed with TLS-Everywhere feature enabled. (passed_phase2 in D/S CI) Note: We are observing on TLS-E envs because in that case the OSP API is a domain that needs to be resolved. On regular SSL jobs we are using directly IPs so there is no need to resolve anything, and in that case is working fine.
Steps to Reproduce:
1. Run passed_phase2 in OSP D/S CI
Actual results:
machine-controller cannot contact the OSP API and therefore it cannot deploy workers. Cluster is not operative.
Expected results:
machine-controller can contact the OSP API and therefore deploy workers.
Additional info:
must-gather provided on private comment.