[OCPBUGS-36384] machine-controller wrongly resolving OSP domain - Red Hat Issue Tracker

Type: Bug
Resolution: Not a Bug
Priority: Normal
Fix Version/s: None
Affects Version/s: 4.16.0
Component/s: Machine Config Operator / platform-openstack
Labels:
- Triaged

Test Coverage:

+
Severity:
Important
Regression:
No
Blocked:
False
Blocked Reason:

Hide

None

Show
None

SFDC Cases Counter:
SFDC Cases Open:
SFDC Cases Links:

Description of problem:

OCP cannot deploy workers because machine-controller is resolving the OSP domain to a wrong IP:

[stack@undercloud-0 ~]$  oc -c machine-controller logs machine-api-controllers-869dfb489f-cj46t
[...]
E0701 11:38:38.094182       1 controller.go:329]  "msg"="Reconciler error" "error"="failed to get InstanceService: Failed to authenticate provider client: Get \"https://overcloud.redhat.local:13000/\": dial tcp 13.248.169.48:13000: connect: connection timed out" "MachineSet"={"name":"ostest-jl84g-worker-0","namespace":"openshift-machine-api"} "controller"="machineset" "controllerGroup"="machine.openshift.io" "controllerKind"="MachineSet" "name"="ostest-jl84g-worker-0" "namespace"="openshift-machine-api" "reconcileID"="02fe9c4b-5206-4b42-971a-b2c02efa22d6"

as a consequence, no workers are deployed.

The IP that is resolving to overcloud.redhat.local is wrong. It should not be 13.248.169.48 but 10.46.43.94

After connecting to the machine-controller pod, we observed that the wrong IP is obtained by appending the suffix "shiftstack.com" to the url:

(shiftstack) [stack@undercloud-0 ~]$ oc -c machine-controller rsh machine-api-controllers-869dfb489f-cj46t  
sh-5.1$ dig overcloud.redhat.local
; <<>> DiG 9.16.23-RH <<>> overcloud.redhat.local
;; global options: +cmd
;; Got answer:
;; WARNING: .local is reserved for Multicast DNS
;; You are currently testing what happens when an mDNS query is leaked to DNS
;; ->>HEADER<<- opcode: QUERY, status: NOERROR, id: 57703
;; flags: qr aa rd ra; QUERY: 1, ANSWER: 1, AUTHORITY: 0, ADDITIONAL: 1;; OPT PSEUDOSECTION:
; EDNS: version: 0, flags:; udp: 1232
; COOKIE: 6011641699a9d37e (echoed)
;; QUESTION SECTION:
;overcloud.redhat.local.                IN      A;; ANSWER SECTION:
overcloud.redhat.local. 30      IN      A       10.46.43.94 // <-- Correct IP
;; Query time: 59 msec
;; SERVER: 172.30.0.10#53(172.30.0.10)
;; WHEN: Mon Jul 01 11:42:25 UTC 2024
;; MSG SIZE  rcvd: 101


sh-5.1$ dig overcloud.redhat.local.shiftstack.com
; <<>> DiG 9.16.23-RH <<>> overcloud.redhat.local.shiftstack.com
;; global options: +cmd
;; Got answer:
;; ->>HEADER<<- opcode: QUERY, status: NOERROR, id: 58703
;; flags: qr rd ra; QUERY: 1, ANSWER: 2, AUTHORITY: 0, ADDITIONAL: 1;; OPT PSEUDOSECTION:
; EDNS: version: 0, flags:; udp: 1232
; COOKIE: fd9fd565b45a0930 (echoed)
;; QUESTION SECTION:
;overcloud.redhat.local.shiftstack.com. IN A;; ANSWER SECTION:
overcloud.redhat.local.shiftstack.com. 30 IN A  76.223.54.146
overcloud.redhat.local.shiftstack.com. 30 IN A  13.248.169.48 // <-- Wrong IP
;; Query time: 4 msec
;; SERVER: 172.30.0.10#53(172.30.0.10)
;; WHEN: Mon Jul 01 11:42:33 UTC 2024
;; MSG SIZE  rcvd: 184

Version-Release number of selected component (if applicable):

4.16.0
RHOS-17.1-RHEL-9-20240516.n.1

Last time we saw it passing is with (RHOS-17.1-RHEL-9-20240516.n.1, 4.16.0-rc.5). Therefore it looks the bug was introduced at some point between 4.16.0-rc.5 and 4.16.0.

How reproducible:

Always on OSP deployed with TLS-Everywhere feature enabled. (passed_phase2 in D/S CI)

Note: We are observing on TLS-E envs because in that case the OSP API is a domain that needs to be resolved. On regular SSL jobs we are using directly IPs so there is no need to resolve anything, and in that case is working fine.

Steps to Reproduce:

    1. Run passed_phase2 in OSP D/S CI

Actual results:

 machine-controller cannot contact the OSP API and therefore it cannot deploy workers. Cluster is not operative.

Expected results:

 machine-controller can contact the OSP API and therefore deploy workers.

Additional info:

must-gather provided on private comment.

Assignee:: Martin André

Reporter:: Ramón Lobillo

QA Contact:: Itshak Brown

Votes:: 0 Vote for this issue

Watchers:: 5 Start watching this issue

Created:: 2024/07/01 11:48 AM

Updated:: 2024/07/04 7:19 AM

Resolved:: 2024/07/03 1:47 PM

Details

Description

Attachments

Easy Agile Planning Poker

Activity

People

Dates