Uploaded image for project: 'Multiple Architecture Enablement'
  1. Multiple Architecture Enablement
  2. MULTIARCH-1415

[4.8] Installation with multipath parameters in parmfile fails (DNS resolution missing)

XMLWordPrintable

    • Icon: Bug Bug
    • Resolution: Done
    • Icon: Normal Normal
    • 4.8.z
    • 4.8
    • Multi-Arch
    • None
    • False
    • False
    • NEW
    • NEW
    • Undefined

      +++ This bug was initially created as a clone of Bug #1974411 +++

      Description of problem:

      An installation with multipath parameters in the parmfile:

      rd.multipath=default
      coreos.inst.install_dev=/dev/mapper/mpatha

      and hostnames in the parmfile fails.

      The installation ends with an emergency shell. Network (IP) is configured, but name resolution is not working. Ping with IP to other system works.

      The same installation (with MP parameters) works, if IP addresses are specified instead of hostnames in the parmfile.

      It also works with hostnames in the parmfile, if rd.multipath=default is removed and sda is used instead of dev/mapper/mpatha.

      It looks like the MP parameter(s) breaks the correct setup of the name resolution during installation. Not sure if it should be there, but there is no /etc/resolv.conf in the booted linux (emergency shell).

      Version-Release number of selected component (if applicable):

      oc version
      Client Version: 4.8.0-0.nightly-s390x-2021-06-18-055818
      Server Version: 4.8.0-0.nightly-s390x-2021-06-18-055818
      Kubernetes Version: v1.21.0-rc.0+120883f

      How reproducible:

      Install a node with MP parameters and hostnames in the parmfile.

      Steps to Reproduce:
      1.
      2.
      3.

      Actual results:

      Installation ends in an emergency shell.

      Expected results:

      Installation process works.

      Additional info:

      — Additional comment from Prashanth Sundararaman on 2021-06-21 19:09:41 EDT —

      — Additional comment from Prashanth Sundararaman on 2021-06-21 19:14:08 EDT —

      Hi Jonathan,

      Is this a possible regression caused by https://github.com/coreos/fedora-coreos-config/pull/1011 ?

      Like the original description says, if the ignition url is configured with a hostname, the coreos-installer errors out. If configured with an ip address, it works.

      Thanks
      Prashanth

      — Additional comment from Dan Li on 2021-06-22 10:25:50 EDT —

      Setting "Blocker-" after discussing with the team. Based on these reasons:
      1. configuring multipath as a day 2 operation still works
      2. specifying ip address instead of hostname works

      — Additional comment from Jonathan Lebon on 2021-06-22 11:02:14 EDT —

      Hmm, I'm not sure how this could be multipath related.
      It looks a lot like https://bugzilla.redhat.com/show_bug.cgi?id=1967483, except in the initrd.

      Full logs from the initrd would be helpful, esp. NetworkManager.

      — Additional comment from Prashanth Sundararaman on 2021-06-22 15:57:13 EDT —

      funnily enough the coreos-livepxe-rootfs.service succeeds so it is able to resolve the hostname there, but not when running the coreos-installer.

      — Additional comment from Jonathan Lebon on 2021-06-22 16:05:19 EDT —

      (In reply to Jonathan Lebon from comment #4)
      > Hmm, I'm not sure how this could be multipath related.
      > It looks a lot like https://bugzilla.redhat.com/show_bug.cgi?id=1967483,
      > except in the initrd.

      Sorry, this is incorrect. This BZ matches rhbz#1967483 in that respect as well, since coreos-installer.service runs in the real root.
      (I'm so used to "emergency shell" referring to the initrd emergency shell that my brain jumped to that. )

      — Additional comment from Nikita Dubrovskii (IBM) on 2021-06-23 08:50:43 EDT —

      Today did some testing of custom rhcos-4.8 (with https://github.com/coreos/coreos-installer/pull/564), ignition config gets downloaded from github.com - system works without any DNS issues.
      (
      Here is cmdline:
      ```
      Kernel command line: rd.neednet=1 dfltcc=off random.trust_cpu=on rd.znet=qeth,0.0.bdf0,0.0.bdf1,0.0.bdf2,layer2=1,por
      tno=0 console=ttysclp0 ip=172.18.142.3::172.18.0.1:255.254.0.0:coreos:encbdf0:off nameserver=172.18.0.1 coreos.inst=yes coreos.inst.
      insecure=yes coreos.inst.ignition_url=https://raw.githubusercontent.com/nikita-dubrovskii/s390x-ignition-configs/master/ignition.ign
      coreos.live.rootfs_url=http://172.18.10.243/rhcos-48.84.202106231130-0-live-rootfs.s390x.img zfcp.allow_lun_scan=0 cio_ignore=all,!
      condev rd.zfcp=0.0.1903,0x500507630910d435,0x408240d100000000 rd.zfcp=0.0.1943,0x500507630914d435,0x408240d100000000 coreos.inst.ins
      tall_dev=sda coreos.inst.mpath=yes
      ```
      )

      Using another zVM/Linux as http-server with ignition config - also works (http://m1314001.lnxne.boe:8080/ignition/ignition.ign).
      But using http://bastion.ocp-m1314001.lnxne.boe:8080/ignition/ignition.ign - doesn't work,
      so i guess there is smth wrong with bastion node's config (as you can see same m1314001 is used as http-server).

      — Additional comment from Jonathan Lebon on 2021-06-24 15:06:15 EDT —

      > Using another zVM/Linux as http-server with ignition config - also works (http://m1314001.lnxne.boe:8080/ignition/ignition.ign).
      But using http://bastion.ocp-m1314001.lnxne.boe:8080/ignition/ignition.ign - doesn't work,
      so i guess there is smth wrong with bastion node's config (as you can see same m1314001 is used as http-server).

      That's interesting, thanks for the tests. I did some interactive debugging via screenshare with @madeel@redhat.com on this and indeed we saw the install pass without multipath enabled, and fail with it enabled.

      I'm still not sure how multipath can affect DNS resolution, unless it simply makes an existing race easier to trigger. If that's the case, then it might be helped by https://github.com/coreos/coreos-installer/pull/565. I've made a scratch build with that patch:

      http://brew-task-repos.usersys.redhat.com/repos/scratch/jlebon/coreos-installer/0.9.0/7.pr565.rhaos4.8.el8/s390x/

      Re-hosted RPMs in a public space if you don't have VPN access:

      https://jlebon.fedorapeople.org/coreos-installer-0.9.0-7.pr565.rhaos4.8.el8.s390x.rpm
      https://jlebon.fedorapeople.org/coreos-installer-bootinfra-0.9.0-7.pr565.rhaos4.8.el8.s390x.rpm

      Developers with access to an s390x machine who can reproduce this bug should be able to build an RHCOS image with those RPMs and test that.

      — Additional comment from Nikita Dubrovskii (IBM) on 2021-06-25 04:00:07 EDT —

      Ok, look's like i've got what' wrong here:

      1) with 'coreos.inst.install_dev=/dev/mapper/mpatha rd.multipath=default' and `hostname.com/ignition.conf` and in the parm file:
      coreos-installer cannot fetch ignition (DNS), but! at first coreos tries to propagate 'multipat.conf' to the '/sysroot', so we end up with a failure:

      ```
      coreos-propagate-multipath-conf[926]: cp: cannot create regular file '/sysroot/etc/multipath.conf': Read-only file system
      systemd[1]: coreos-propagate-multipath-conf.service: Main process exited, code=exited, status=1/FAILURE
      ...

      systemd[1]: Reached target Emergency Mode.
      ```

      2) with `coreos.inst.install_dev=/dev/mapper/mpatha rd.multipath=default` and `1.2.3.4/ignition.conf` in the parm file:
      coreos-installer can fetch ignition (no DNS) , but fails with `kpartx` (propagation of 'multipat.conf' to the '/sysroot' also failed):

      ```
      coreos-propagate-multipath-conf[926]: cp: cannot create regular file '/sysroot/etc/multipath.conf': Read-only file system
      ...
      systemd[1]: Reached target Emergency Mode.
      ...

      [ 23.522376] coreos-installer-service[1859]: device-mapper: resume ioctl on mpatha4 failed: Invalid argument
      [ 23.522453] coreos-installer-service[1859]: resume failed on mpatha4
      [ 23.811211] coreos-installer-service[1859]: Error: getting partition table for /dev/mapper/mpatha
      [ 23.811374] coreos-installer-service[1859]: Caused by:
      [ 23.811395] coreos-installer-service[1859]: "kpartx" "-u" "-n" "/dev/dm-0" failed with exit code: 1
      Failed to start CoreOS Installer.
      ```

      If we take a look at /etc/resolv.conf without multipath, we have valid config:
      ```
      search lnxne.boe
      nameserver 172.18.0.1
      ```

      But with `rd.multipath=default` it's empty, systemd already had failed, so for me it looks not like a DNS issue.

      And installing this way also makes no sense - during fristboot coreos starts without multipath,
      so i don't see any reason for installing coreos with `rd.multipath=default` right now.

      i would assume this as not a bug, or not a DNS-bug

      — Additional comment from Jonathan Lebon on 2021-06-28 12:10:41 EDT —

      (In reply to Nikita Dubrovskii (IBM) from comment #9)
      > Ok, look's like i've got what' wrong here:
      >
      > 1) with 'coreos.inst.install_dev=/dev/mapper/mpatha rd.multipath=default'
      > and `hostname.com/ignition.conf` and in the parm file:

      What is that karg? Do you mean `ip=...`? Can you show the full parmfile you used?

      > coreos-installer cannot fetch ignition (DNS), but! at first coreos tries to
      > propagate 'multipat.conf' to the '/sysroot', so we end up with a failure:
      >
      > ```
      > coreos-propagate-multipath-conf[926]: cp: cannot create regular file
      > '/sysroot/etc/multipath.conf': Read-only file system
      > systemd[1]: coreos-propagate-multipath-conf.service: Main process exited,
      > code=exited, status=1/FAILURE
      > ...
      >
      > systemd[1]: Reached target Emergency Mode.
      > ```

      Ouch good catch. So we continue on to the real root even if the service failed.

      > 2) with `coreos.inst.install_dev=/dev/mapper/mpatha rd.multipath=default`
      > and `1.2.3.4/ignition.conf` in the parm file:
      > coreos-installer can fetch ignition (no DNS) , but fails with `kpartx`
      > (propagation of 'multipat.conf' to the '/sysroot' also failed):
      >
      > ```
      > coreos-propagate-multipath-conf[926]: cp: cannot create regular file
      > '/sysroot/etc/multipath.conf': Read-only file system
      > ...
      > systemd[1]: Reached target Emergency Mode.
      > ...
      >
      > [ 23.522376] coreos-installer-service[1859]: device-mapper: resume ioctl
      > on mpatha4 failed: Invalid argument
      > [ 23.522453] coreos-installer-service[1859]: resume failed on mpatha4
      > [ 23.811211] coreos-installer-service[1859]: Error: getting partition
      > table for /dev/mapper/mpatha
      > [ 23.811374] coreos-installer-service[1859]: Caused by:
      > [ 23.811395] coreos-installer-service[1859]: "kpartx" "-u" "-n"
      > "/dev/dm-0" failed with exit code: 1
      > Failed to start CoreOS Installer.
      > ```
      >
      > If we take a look at /etc/resolv.conf without multipath, we have valid
      > config:
      > ```
      > search lnxne.boe
      > nameserver 172.18.0.1
      > ```
      >
      > But with `rd.multipath=default` it's empty, systemd already had failed, so
      > for me it looks not like a DNS issue.

      OK, so I think there are two issues here:
      1. `coreos-propagate-multipath-conf.service` doesn't have

      ```
      OnFailure=emergency.target
      OnFailureJobMode=isolate
      ```

      2. We have no ordering between `coreos-propagate-multipath-conf.service` and `sysroot-etc.mount`.

      <times passes>

      Filed: https://github.com/coreos/fedora-coreos-config/pull/1077

      Can you try that out?

      > And installing this way also makes no sense - during fristboot coreos starts
      > without multipath,
      > so i don't see any reason for installing coreos with `rd.multipath=default`
      > right now.

      It's valid to turn on multipath at installation time so that coreos-installer can copy the content on top of the multipath target (for the same reasons as https://github.com/coreos/fedora-coreos-config/pull/1011). coreos-installer should support this already (see e.g. https://github.com/coreos/coreos-installer/pull/499), but if we hit issues with kpartx there, let's work on fixing them.

      — Additional comment from Dan Li on 2021-06-28 14:49:30 EDT —

      Hi Muhammad, do you think this bug will be resolved before the end of this sprint (July 3rd)? If not, can we set "Reviewed-in-Sprint"?

      — Additional comment from on 2021-06-29 03:45:13 EDT —

      Hi Dan, The root cause is still not clear, so please set the reviewed flag.

      — Additional comment from Nikita Dubrovskii (IBM) on 2021-06-29 04:45:21 EDT —

      (In reply to Jonathan Lebon from comment #10)
      > (In reply to Nikita Dubrovskii (IBM) from comment #9)
      > > Ok, look's like i've got what' wrong here:
      > >
      > > 1) with 'coreos.inst.install_dev=/dev/mapper/mpatha rd.multipath=default'
      > > and `hostname.com/ignition.conf` and in the parm file:
      >
      > What is that karg? Do you mean `ip=...`? Can you show the full parmfile you
      > used?

      no, it's not an IP here, but some hostname:
      ``` ip=172.18.142.3::172.18.0.1:255.254.0.0:coreos:encbdf0:off nameserver=172.18.0.1 coreos.inst=yes coreos.inst.ignition_url=http://m1314001.lnxne.boe:8080/ignition/ignition.ign ```

      ```

      > OK, so I think there are two issues here:
      > 1. `coreos-propagate-multipath-conf.service` doesn't have
      >
      > ```
      > OnFailure=emergency.target
      > OnFailureJobMode=isolate
      > ```
      > 2. We have no ordering between `coreos-propagate-multipath-conf.service` and
      > `sysroot-etc.mount`.
      >
      > <times passes>
      >
      > Filed: https://github.com/coreos/fedora-coreos-config/pull/1077
      >
      > Can you try that out?

      Did it, works as expected:

      > with kpartx there, let's work on fixing them.

      Here is PR for kpartx issue:
      https://github.com/coreos/coreos-installer/pull/566

      — Additional comment from Dan Li on 2021-07-19 10:38:56 EDT —

      Hi Muhammad, do you think this bug will move past ON_QA by the end of this Sprint? If not, can we add "reviewed-in-sprint" flag?

      — Additional comment from on 2021-07-19 11:02:34 EDT —

      Hi Jonathan, do you know when the fixhttps://github.com/coreos/fedora-coreos-config/pull/1077 will be pickup by RHCOS?

      — Additional comment from Jonathan Lebon on 2021-07-19 11:46:06 EDT —

      Will try to get it in the next 4.8 bootimage bump.

            jlebon1@redhat.com Jonathan Lebon
            jlebon1@redhat.com Jonathan Lebon
            Votes:
            0 Vote for this issue
            Watchers:
            2 Start watching this issue

              Created:
              Updated:
              Resolved: