Uploaded image for project: 'OpenShift Bugs'
  1. OpenShift Bugs
  2. OCPBUGS-38507

static IP manager crashloops for a while on pod startup

XMLWordPrintable

    • None
    • False
    • Hide

      None

      Show
      None

      Observing a CI test where the metal3 Pod is deleted and allowed to recreate on another host, it took 5 attempts to start the new pod because static-ip-manager was crashlooping with the following log:

      + '[' -z 172.22.0.3/24 ']'
      + '[' -z enp1s0 ']'
      + '[' -n enp1s0 ']'
      ++ ip -o addr show dev enp1s0 scope global
      + [[ -n 2: enp1s0    inet 172.22.0.134/24 brd 172.22.0.255 scope global dynamic noprefixroute enp1s0\       valid_lft 3sec preferred_lft 3sec ]]
      + ip -o addr show dev enp1s0 scope global
      + grep -q 172.22.0.3/24
      ERROR: "enp1s0" is already set to ip address belong to different subset than "172.22.0.3/24"
      + echo 'ERROR: "enp1s0" is already set to ip address belong to different subset than "172.22.0.3/24"'
      + exit 1

      The error message is misleading about what is actually checked (apart from the whole subnet/subset typo). It doesn't appear this should ever work for IPv4, since we don't ever expect the Provisioning VIP to appear on the interface before we've set it. (With IPv6 this should often work thanks to an appalling and unsafe hack. Not to suggest that grepping for an IPv4 address complete with .'s in it is safe either.)

       

      Eventually the pod does start up, with this in the log:

      + '[' -z 172.22.0.3/24 ']'
      + '[' -z enp1s0 ']'
      + '[' -n enp1s0 ']'
      ++ ip -o addr show dev enp1s0 scope global
      + [[ -n '' ]]
      + /usr/sbin/ip address flush dev enp1s0 scope global
      + /usr/sbin/ip addr add 172.22.0.3/24 dev enp1s0 valid_lft 300 preferred_lft 300

      So essentially this only worked because there are no IP addresses on the provisioning interface.

      In the original (error) log the machine's IP 172.22.0.134/24 has a valid lifetime of 3s, so that likely explains why it later disappears. The provisioning network is managed, so the IP address comes from dnsmasq in the former incarnation of the metal3 pod. We effectively prevent the new pod from starting until the DHCP addresses have timed out, even though we will later flush them to ensure no stale ones are left behind.

      The check was originally added by https://github.com/openshift/ironic-static-ip-manager/pull/27 but that only describes what it does and not the reason. There's no linked ticket to indicate what the purpose was.

            zabitter Zane Bitter
            zabitter Zane Bitter
            Steeve Goveas Steeve Goveas
            Votes:
            0 Vote for this issue
            Watchers:
            5 Start watching this issue

              Created:
              Updated: