Uploaded image for project: 'OpenShift Service Mesh'
  1. OpenShift Service Mesh
  2. OSSM-6699

Fix failures of IPv6 cluster installation

XMLWordPrintable

    • Icon: Task Task
    • Resolution: Done
    • Icon: Blocker Blocker
    • OSSM 2.5.3, OSSM 2.6.0
    • OSSM 2.5.3, OSSM 2.6.0
    • Maistra, QE
    • None

      The installation of IPv6 cluster fails in bootstrap VM on `I/O timeout` during the init phase during connection to DNS server (in order to resolve where the service VM is with bootstrap.ign file).

      [   22.836355] ignition[786]: GET error: Get "http://api.ossm8.maistra.upshift.redhat.com:8000/bootstrap.ign": dial tcp: lookup api.ossm8.maistra.upshift.redhat.com on [2620:52:0:88:f816:3eff:fecb:16e2]:53: read udp [2620:52:0:9c:f816:3eff:fe77:c38e]:51369->[2620:52:0:88:f816:3eff:fecb:16e2]:53: i/o timeout 

      The bootstrap is not able to download ignition file from the service VM so the bootstrap VM never get initialized.

      Context:
      Since the IPv6 OpenStack network doesn't have DHCP, the correct kernel arguments with IP, Gateway IP and Nameserver API need to be passed via GRUB on start into bootstrap/master/worker VM in order to which the VM can download ignition file from the service VM and initializes the system.

      So, the connection goes:

      bootstrap VM (provider_net_ipv6_only network) -> DNS/Proxy VM (provider_net_cci_1 network) -> service VM (provider_net_cci_2 network)

      Due to that, it is impossible to connect via ssh to bootstrap/master/worker VM before first initialization ( to ping DNS, services VM etc.)

      Investigation:

      • checked DNS/Proxy VM 
        • logs didn't contain any errors
        • port for DNS (53) was opened and accessible from the service VM ( provider_net_cci_2 network )
        • created RHEL VM on ipv6 network, set up DNS in /etc/resolv.conf, and tried to download the bootstrap.ign file via curl, no issue, so DNS is accessible from provider_net_ipv6_only network  network. So the question is then why the bootstrap VM which is on the same network is not able to download the files and fails on i/o timeout.
        • reinstalled DNS/Proxy VM to be sure that VM behaves correctly since it was 1 year old and survived lots of OpenStack maintenance and upgrades, but no luck
      • tried to use IP of the service VM to access there directly from bootstrap a bypass DNS VM
        • changes in the pipeline: https://gitlab.cee.redhat.com/istio/kiali-qe/kiali-qe-utils/-/commit/bbe29edef3e40283aea9e9649fbe21d1b26776d8
        • the bootstrap was able to download the ignition file, however, the same problem happened to the master/worker's VMs which need to download the ignition file, what's more, the ignition files are now on a different URL ( because it is managed by OCP installer ) with self-sign cert which can cause the problem. However, during the second run, the problem was back on bootstrap as well, so bypassing DNS/Proxy VM didn't help, so maybe there is a problem connection to the service VM
          [   32.849865] ignition[788]: GET error: Get "http://[2620:52:0:88:f816:3eff:fed4:9cb5]:8000/bootstrap.ign": dial tcp [2620:52:0:88:f816:3eff:fed4:9cb5]:8000: i/o timeout 
      • tried to install DNS/Proxy VM on the same network where service VM are created during the cluster installation ( provider_net_cci_2 network ), to minimize the problem with connections between shared networks, but no luck
      • tried to install the IPv6 cluster to RHOS-01 (needed to set all security groups and update the pipelines), but the issue was not there so it is related to RHOS-D only. However, I noticed that RHOS-01 has only one dual-stack network and only one IPv6 network in compare to RHOS-D where also `provider_net_cci_1`, `provider_net_cci_2` and `provider_net_cci_3` are dualstack
      • so tried to install DNS/Proxy VM as well as service VM during the installation on dual stack network `provider_net_dualstack_1` in RHOS-D. It looks like that resolved the problem, so the issue happens only when the cluster service VM is on `provider_net_cci_2`. But it occurs only during the ignition phase on `rhcos` VMs, not when the VM is ready or on manually RHEL VM created (then it is able to connect to the service VM.)

              mkralik@redhat.com Matej Kralik
              mkralik@redhat.com Matej Kralik
              Votes:
              0 Vote for this issue
              Watchers:
              2 Start watching this issue

                Created:
                Updated:
                Resolved: