Uploaded image for project: 'OpenShift Bugs'
  1. OpenShift Bugs
  2. OCPBUGS-36384

machine-controller wrongly resolving OSP domain

    • +
    • Important
    • No
    • False
    • Hide

      None

      Show
      None

      Description of problem:

      OCP cannot deploy workers because machine-controller is resolving the OSP domain to a wrong IP:

      [stack@undercloud-0 ~]$  oc -c machine-controller logs machine-api-controllers-869dfb489f-cj46t
      [...]
      E0701 11:38:38.094182       1 controller.go:329]  "msg"="Reconciler error" "error"="failed to get InstanceService: Failed to authenticate provider client: Get \"https://overcloud.redhat.local:13000/\": dial tcp 13.248.169.48:13000: connect: connection timed out" "MachineSet"={"name":"ostest-jl84g-worker-0","namespace":"openshift-machine-api"} "controller"="machineset" "controllerGroup"="machine.openshift.io" "controllerKind"="MachineSet" "name"="ostest-jl84g-worker-0" "namespace"="openshift-machine-api" "reconcileID"="02fe9c4b-5206-4b42-971a-b2c02efa22d6"

      as a consequence, no workers are deployed.

      The IP that is resolving to overcloud.redhat.local is wrong. It should not be 13.248.169.48 but 10.46.43.94

      After connecting to the machine-controller pod, we observed that the wrong IP is obtained by appending the suffix "shiftstack.com" to the url:

      (shiftstack) [stack@undercloud-0 ~]$ oc -c machine-controller rsh machine-api-controllers-869dfb489f-cj46t  
      sh-5.1$ dig overcloud.redhat.local
      ; <<>> DiG 9.16.23-RH <<>> overcloud.redhat.local
      ;; global options: +cmd
      ;; Got answer:
      ;; WARNING: .local is reserved for Multicast DNS
      ;; You are currently testing what happens when an mDNS query is leaked to DNS
      ;; ->>HEADER<<- opcode: QUERY, status: NOERROR, id: 57703
      ;; flags: qr aa rd ra; QUERY: 1, ANSWER: 1, AUTHORITY: 0, ADDITIONAL: 1;; OPT PSEUDOSECTION:
      ; EDNS: version: 0, flags:; udp: 1232
      ; COOKIE: 6011641699a9d37e (echoed)
      ;; QUESTION SECTION:
      ;overcloud.redhat.local.                IN      A;; ANSWER SECTION:
      overcloud.redhat.local. 30      IN      A       10.46.43.94 // <-- Correct IP
      ;; Query time: 59 msec
      ;; SERVER: 172.30.0.10#53(172.30.0.10)
      ;; WHEN: Mon Jul 01 11:42:25 UTC 2024
      ;; MSG SIZE  rcvd: 101
      
      
      sh-5.1$ dig overcloud.redhat.local.shiftstack.com
      ; <<>> DiG 9.16.23-RH <<>> overcloud.redhat.local.shiftstack.com
      ;; global options: +cmd
      ;; Got answer:
      ;; ->>HEADER<<- opcode: QUERY, status: NOERROR, id: 58703
      ;; flags: qr rd ra; QUERY: 1, ANSWER: 2, AUTHORITY: 0, ADDITIONAL: 1;; OPT PSEUDOSECTION:
      ; EDNS: version: 0, flags:; udp: 1232
      ; COOKIE: fd9fd565b45a0930 (echoed)
      ;; QUESTION SECTION:
      ;overcloud.redhat.local.shiftstack.com. IN A;; ANSWER SECTION:
      overcloud.redhat.local.shiftstack.com. 30 IN A  76.223.54.146
      overcloud.redhat.local.shiftstack.com. 30 IN A  13.248.169.48 // <-- Wrong IP
      ;; Query time: 4 msec
      ;; SERVER: 172.30.0.10#53(172.30.0.10)
      ;; WHEN: Mon Jul 01 11:42:33 UTC 2024
      ;; MSG SIZE  rcvd: 184
      

       

       

      Version-Release number of selected component (if applicable):

      4.16.0
      RHOS-17.1-RHEL-9-20240516.n.1

      Last time we saw it passing is with (RHOS-17.1-RHEL-9-20240516.n.1, 4.16.0-rc.5). Therefore it looks the bug was introduced at some point between 4.16.0-rc.5 and 4.16.0.

      How reproducible:

      Always on OSP deployed with TLS-Everywhere feature enabled. (passed_phase2 in D/S CI)
      
      Note: We are observing on TLS-E envs because in that case the OSP API is a domain that needs to be resolved. On regular SSL jobs we are using directly IPs so there is no need to resolve anything, and in that case is working fine.

      Steps to Reproduce:

          1. Run passed_phase2 in OSP D/S CI

      Actual results:

       machine-controller cannot contact the OSP API and therefore it cannot deploy workers. Cluster is not operative.

      Expected results:

       machine-controller can contact the OSP API and therefore deploy workers.

      Additional info:

      must-gather provided on private comment.    

              maandre@redhat.com Martin André
              rlobillo Ramón Lobillo
              Itshak Brown Itshak Brown
              Votes:
              0 Vote for this issue
              Watchers:
              5 Start watching this issue

                Created:
                Updated:
                Resolved: