Uploaded image for project: 'OpenShift Bugs'
  1. OpenShift Bugs
  2. OCPBUGS-43528

IPI Baremetal - BootstrapVM interface restart impacts pulling image and causes ironic service to fail

XMLWordPrintable

    • None
    • Metal Platform 261, Metal Platform 262
    • 2
    • False
    • Hide

      None

      Show
      None
    • Hide
      *Cause*: In configurations where the Provisioning network is disabled, but the bootstrapProvisioningIP is nevertheless set, the IP address of the external network interface on the bootstrap VM could be reconfigured while other processes are attempting to use the network.
      *Consequence*: If the interface reconfiguration occurs during the process of pulling container images (which it does about half the time), the baremetal provisioning components would fail to start.
      *Fix*: Dependencies were added to ensure that nothing else can be running at the time the interface is being reconfigured.
      *Result*: Reconfiguration always happens when the network is idle, and the baremetal provisioning components should always start normally.
      Show
      *Cause*: In configurations where the Provisioning network is disabled, but the bootstrapProvisioningIP is nevertheless set, the IP address of the external network interface on the bootstrap VM could be reconfigured while other processes are attempting to use the network. *Consequence*: If the interface reconfiguration occurs during the process of pulling container images (which it does about half the time), the baremetal provisioning components would fail to start. *Fix*: Dependencies were added to ensure that nothing else can be running at the time the interface is being reconfigured. *Result*: Reconfiguration always happens when the network is idle, and the baremetal provisioning components should always start normally.
    • Bug Fix
    • In Progress

      This is a clone of issue OCPBUGS-36869. The following is the description of the original issue:

      Description of problem:

      IPI Baremetal - BootstrapVM machineNetwork interface restart impacts pulling image and causes ironic service to fail
          

      Version-Release number of selected component (if applicable):

      4.16.Z but also seen this in 4.15 and 4.17
          

      How reproducible:

      50% of our jobs fail because of this.
          

      Steps to Reproduce:

          1. Prepare an IPI baremetal deployment (we have provisioning network disabled, we are using Virtual Media)
          2. Start a deployment, wait for the bootstrapVM to start running and login via SSH
          3. Run the command: journalctl -f | grep "Dependency failed for Ironic baremetal deployment service"
          4. If the command above returns something, then print around 70 lines before and check for the NetworkManager entries in the log about the interface in the baremetal network getting restarted and an error about pulling an image because DNS is not reachable. 
      
          

      Actual results:

      Deployments fail 50% of the time, bootstrapVM is not able to pull an image because main machineNetwork interface is getting restarted and DNS resolution fails.
          

      Expected results:

      Deployments work 100% of the time, bootstrapVM is able to pull any image because machineNetwork interface is NOT restarted while images are getting pulled. 
          

      Additional info:

      We have a CI system to test OCP 4.12 through 4.17 deployments and this issue started to occurred a few weeks ago. mainly in 4.15, 4.16, and 4.17 
          

      In this log extract of a deployment with OCP 4.16.0-0.nightly-2024-07-07-171226: you can see the image pull error because is not able to resolve the registry name, but in the lines before and after you can see that the machineNetwork interface is getting restarted, causing the lack of DNS resolution.

      Mon 2024-07-08 23:15:15 UTC localhost.localdomain init.scope[1]: Finished Build Ironic environment.
      Mon 2024-07-08 23:15:15 UTC localhost.localdomain init.scope[1]: Starting Extract Machine OS Images...
      Mon 2024-07-08 23:15:15 UTC localhost.localdomain init.scope[1]: Starting Provisioning interface...
      Mon 2024-07-08 23:15:15 UTC localhost.localdomain extract-machine-os.service[3779]: Trying to pull quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:1370c041f0ecf4f6590c12f3e1b49111aa35052140e7fdd79964c32db47074c1...
      Mon 2024-07-08 23:15:15 UTC localhost.localdomain NetworkManager.service[1741]: <info>  [1720480515.3899] audit: op="connection-update" uuid="bf7e41e3-f1ea-3eed-98fd-c3d021e35d11" 
      name="Wired connection 1" args="ipv4.addresses" pid=3812 uid=0 result="success"
      Mon 2024-07-08 23:15:15 UTC localhost.localdomain NetworkManager.service[1741]: <warn>  [1720480515.4008] keyfile: load: "/etc/NetworkManager/system-connections/nmconnection": failed to load connection: invalid connection: connection.type: property is missing
      Mon 2024-07-08 23:15:15 UTC localhost.localdomain NetworkManager.service[1741]: <info>  [1720480515.4018] audit: op="connections-reload" pid=3817 uid=0 result="success"
      Mon 2024-07-08 23:15:15 UTC localhost.localdomain NetworkManager.service[1741]: <info>  [1720480515.4159] agent-manager: agent[543677841603162b,:1.67/nmcli-connect/0]: agent registered
      Mon 2024-07-08 23:15:15 UTC localhost.localdomain NetworkManager.service[1741]: <info>  [1720480515.4164] device (ens3): state change: activated -> deactivating (reason 'new-activation', sys-iface-state: 'managed')
      Mon 2024-07-08 23:15:15 UTC localhost.localdomain NetworkManager.service[1741]: <info>  [1720480515.4170] manager: NetworkManager state is now CONNECTED_LOCAL
      Mon 2024-07-08 23:15:15 UTC localhost.localdomain NetworkManager.service[1741]: <info>  [1720480515.4172] device (ens3): disconnecting for new activation request.
      Mon 2024-07-08 23:15:15 UTC localhost.localdomain NetworkManager.service[1741]: <info>  [1720480515.4172] audit: op="connection-activate" uuid="bf7e41e3-f1ea-3eed-98fd-c3d021e35d11" name="Wired connection 1" pid=3821 uid=0 result="success"
      Mon 2024-07-08 23:15:15 UTC localhost.localdomain NetworkManager.service[1741]: <info>  [1720480515.4200] device (ens3): state change: deactivating -> disconnected (reason 'new-activation', sys-iface-state: 'managed')
      Mon 2024-07-08 23:15:15 UTC localhost.localdomain NetworkManager.service[1741]: <info>  [1720480515.4214] dhcp4 (ens3): canceled DHCP transaction
      Mon 2024-07-08 23:15:15 UTC localhost.localdomain NetworkManager.service[1741]: <info>  [1720480515.4215] dhcp4 (ens3): activation: beginning transaction (timeout in 45 seconds)
      Mon 2024-07-08 23:15:15 UTC localhost.localdomain NetworkManager.service[1741]: <info>  [1720480515.4215] dhcp4 (ens3): state changed no lease
      Mon 2024-07-08 23:15:15 UTC localhost.localdomain NetworkManager.service[1741]: <info>  [1720480515.4216] dhcp6 (ens3): canceled DHCP transaction
      Mon 2024-07-08 23:15:15 UTC localhost.localdomain NetworkManager.service[1741]: <info>  [1720480515.4216] dhcp6 (ens3): activation: beginning transaction (timeout in 45 seconds)
      Mon 2024-07-08 23:15:15 UTC localhost.localdomain NetworkManager.service[1741]: <info>  [1720480515.4216] dhcp6 (ens3): state changed no lease
      
      Mon 2024-07-08 23:15:15 UTC localhost.localdomain extract-machine-os.service[3779]: Error: initializing source docker://quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:1370c041f0ecf4f6590c12f3e1b49111aa35052140e7fdd79964c32db47074c1: (Mirrors also failed: [registry.dfwt5g.lab:4443/ocp-4.16/4.16.0-0.nightly-2024-07-07-171226@sha256:1370c041f0ecf4f6590c1
      2f3e1b49111aa35052140e7fdd79964c32db47074c1: Get "https://registry.dfwt5g.lab:4443/v2/ocp-4.16/4.16.0-0.nightly-2024-07-07-171226/manifests/sha256:1370c041f0ecf4f6590c12f3e1b49111a
      a35052140e7fdd79964c32db47074c1": dial tcp 192.168.5.9:4443: connect: network is unreachable]): quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:1370c041f0ecf4f6590c12f3e1b491
      11aa35052140e7fdd79964c32db47074c1: pinging container registry quay.io: Get "https://quay.io/v2/": dial tcp: lookup quay.io on 192.168.32.8:53: dial udp 192.168.32.8:53: connect: network is unreachable
      
      Mon 2024-07-08 23:15:15 UTC localhost.localdomain init.scope[1]: extract-machine-os.service: Main process exited, code=exited, status=125/n/a
      Mon 2024-07-08 23:15:15 UTC localhost.localdomain chronyd.service[1764]: Source 2607:b500:410:7700::1 offline
      Mon 2024-07-08 23:15:15 UTC localhost.localdomain chronyd.service[1764]: Source 69.10.223.134 offline
      Mon 2024-07-08 23:15:15 UTC localhost.localdomain NetworkManager.service[1741]: <info>  [1720480515.4309] policy: set-hostname: set hostname to 'localhost.localdomain' (no hostname found)
      Mon 2024-07-08 23:15:15 UTC localhost.localdomain chronyd.service[1764]: Source 207.246.65.226 offline
      Mon 2024-07-08 23:15:15 UTC localhost.localdomain NetworkManager.service[1741]: <info>  [1720480515.4309] device (ens3): Activation: starting connection 'Wired connection 1' (bf7e41e3-f1ea-3eed-98fd-c3d021e35d11)
      Mon 2024-07-08 23:15:15 UTC localhost.localdomain chronyd.service[1764]: Source 2001:470:f1c4:1::42 offline
      Mon 2024-07-08 23:15:15 UTC localhost.localdomain NetworkManager.service[1741]: <info>  [1720480515.4315] device (ens3): state change: disconnected -> prepare (reason 'none', sys-iface-state: 'managed')
      Mon 2024-07-08 23:15:15 UTC localhost.localdomain chronyd.service[1764]: Source 2603:c020:0:8369::feeb:dab offline
      Mon 2024-07-08 23:15:15 UTC localhost.localdomain NetworkManager.service[1741]: <info>  [1720480515.4317] manager: NetworkManager state is now CONNECTING
      Mon 2024-07-08 23:15:15 UTC localhost.localdomain chronyd.service[1764]: Source 2600:3c01:e000:7e6::123 offline
      Mon 2024-07-08 23:15:15 UTC localhost.localdomain NetworkManager.service[1741]: <info>  [1720480515.4317] device (ens3): state change: prepare -> config (reason 'none', sys-iface-state: 'managed')
      Mon 2024-07-08 23:15:15 UTC localhost.localdomain chronyd.service[1764]: Source 192.168.32.8 offline
      Mon 2024-07-08 23:15:15 UTC localhost.localdomain NetworkManager.service[1741]: <info>  [1720480515.4322] device (ens3): state change: config -> ip-config (reason 'none', sys-iface-state: 'managed')
      Mon 2024-07-08 23:15:15 UTC localhost.localdomain chronyd.service[1764]: Source 69.89.207.99 offline
      Mon 2024-07-08 23:15:15 UTC localhost.localdomain NetworkManager.service[1741]: <info>  [1720480515.4326] dhcp4 (ens3): activation: beginning transaction (timeout in 45 seconds)
      Mon 2024-07-08 23:15:15 UTC localhost.localdomain chronyd.service[1764]: Source 135.148.100.14 offline
      Mon 2024-07-08 23:15:15 UTC localhost.localdomain NetworkManager.service[1741]: <info>  [1720480515.4347] dhcp4 (ens3): state changed new lease, address=192.168.32.28
      Mon 2024-07-08 23:15:15 UTC localhost.localdomain NetworkManager.service[1741]: <info>  [1720480515.4350] policy: set 'Wired connection 1' (ens3) as default for IPv4 routing and DNS
      Mon 2024-07-08 23:15:15 UTC localhost.localdomain NetworkManager.service[1741]: <info>  [1720480515.4385] device (ens3): state change: ip-config -> ip-check (reason 'none', sys-iface-state: 'managed')
      Mon 2024-07-08 23:15:15 UTC localhost.localdomain chronyd.service[1764]: Removed source 192.168.32.8
      Mon 2024-07-08 23:15:15 UTC localhost.localdomain chronyd.service[1764]: Source 69.10.223.134 online
      Mon 2024-07-08 23:15:15 UTC localhost.localdomain chronyd.service[1764]: Source 207.246.65.226 online
      Mon 2024-07-08 23:15:15 UTC localhost.localdomain chronyd.service[1764]: Source 69.89.207.99 online
      Mon 2024-07-08 23:15:15 UTC localhost.localdomain chronyd.service[1764]: Source 135.148.100.14 online
      Mon 2024-07-08 23:15:15 UTC localhost.localdomain init.scope[1]: extract-machine-os.service: Failed with result 'exit-code'.
      Mon 2024-07-08 23:15:15 UTC localhost.localdomain init.scope[1]: Failed to start Extract Machine OS Images.
      Mon 2024-07-08 23:15:15 UTC localhost.localdomain init.scope[1]: Dependency failed for Customized Machine OS Image Server.
      Mon 2024-07-08 23:15:15 UTC localhost.localdomain init.scope[1]: Dependency failed for Ironic baremetal deployment service.
      Mon 2024-07-08 23:15:15 UTC localhost.localdomain init.scope[1]: ironic.service: Job ironic.service/start failed with result 'dependency'.
      Mon 2024-07-08 23:15:15 UTC localhost.localdomain init.scope[1]: Dependency failed for Metal3 deployment service.
      Mon 2024-07-08 23:15:15 UTC localhost.localdomain init.scope[1]: metal3-baremetal-operator.service: Job metal3-baremetal-operator.service/start failed with result 'dependency'.
      Mon 2024-07-08 23:15:15 UTC localhost.localdomain init.scope[1]: image-customization.service: Job image-customization.service/start failed with result 'dependency'.
      Mon 2024-07-08 23:15:15 UTC localhost.localdomain init.scope[1]: Starting Ironic ramdisk logger...
      Mon 2024-07-08 23:15:15 UTC localhost.localdomain init.scope[1]: Starting Update master BareMetalHosts with introspection data...
      Mon 2024-07-08 23:15:15 UTC localhost.localdomain NetworkManager-dispatcher.service[3899]: NM local-dns-prepender triggered by ens3 dhcp4-change.
      Mon 2024-07-08 23:15:15 UTC localhost.localdomain NetworkManager-dispatcher.service[3899]: <13>Jul  8 23:15:15 root: NM local-dns-prepender triggered by ens3 dhcp4-change.
      Mon 2024-07-08 23:15:15 UTC localhost.localdomain NetworkManager-dispatcher.service[3901]: NM resolv-prepender: Checking for nameservers in /var/run/NetworkManager/resolv.conf
      Mon 2024-07-08 23:15:15 UTC localhost.localdomain NetworkManager-dispatcher.service[3903]: nameserver 192.168.32.8
      Mon 2024-07-08 23:15:15 UTC localhost.localdomain NetworkManager-dispatcher.service[3905]: Failed to get unit file state for systemd-resolved.service: No such file or directory
      Mon 2024-07-08 23:15:15 UTC localhost.localdomain root[3911]: NM local-dns-prepender: Checking if local DNS IP is the first entry in resolv.conf
      Mon 2024-07-08 23:15:15 UTC localhost.localdomain NetworkManager-dispatcher.service[3911]: <13>Jul  8 23:15:15 root: NM local-dns-prepender: Checking if local DNS IP is the first entry in resolv.conf
      Mon 2024-07-08 23:15:15 UTC localhost.localdomain NetworkManager-dispatcher.service[3917]: NM local-dns-prepender: local DNS IP already is the first entry in resolv.conf
      Mon 2024-07-08 23:15:15 UTC localhost.localdomain NetworkManager-dispatcher.service[3917]: <13>Jul  8 23:15:15 root: NM local-dns-prepender: local DNS IP already is the first entry in resolv.conf
      Mon 2024-07-08 23:15:15 UTC localhost.localdomain NetworkManager.service[1741]: <info>  [1720480515.5372] device (ens3): state change: ip-check -> secondaries (reason 'none', sys-iface-state: 'managed')
      Mon 2024-07-08 23:15:15 UTC localhost.localdomain provisioning-interface.service[3821]: Connection successfully activated (D-Bus active path: /org/freedesktop/NetworkManager/ActiveMon 2024-07-08 23:15:15 UTC localhost.localdomain NetworkManager.service[1741]: <info>  [1720480515.5375] device (ens3): state change: secondaries -> activated (reason 'none', sys-iface-state: 'managed')
      Mon 2024-07-08 23:15:15 UTC localhost.localdomain NetworkManager.service[1741]: <info>  [1720480515.5377] manager: NetworkManager state is now CONNECTED_SITE
      Mon 2024-07-08 23:15:15 UTC localhost.localdomain NetworkManager.service[1741]: <info>  [1720480515.5379] device (ens3): Activation: successful, device activated.
      Mon 2024-07-08 23:15:15 UTC localhost.localdomain NetworkManager.service[1741]: <info>  [1720480515.5383] manager: NetworkManager state is now CONNECTED_GLOBAL
      Mon 2024-07-08 23:15:15 UTC localhost.localdomain init.scope[1]: Finished Provisioning interface.
          

              zabitter Zane Bitter
              openshift-crt-jira-prow OpenShift Prow Bot
              Jad Haj Yahya Jad Haj Yahya
              Votes:
              0 Vote for this issue
              Watchers:
              3 Start watching this issue

                Created:
                Updated: