-
Bug
-
Resolution: Unresolved
-
Minor
-
None
-
4.17
-
None
-
False
-
Observing a CI test where the metal3 Pod is deleted and allowed to recreate on another host, it took 5 attempts to start the new pod because static-ip-manager was crashlooping with the following log:
+ '[' -z 172.22.0.3/24 ']' + '[' -z enp1s0 ']' + '[' -n enp1s0 ']' ++ ip -o addr show dev enp1s0 scope global + [[ -n 2: enp1s0 inet 172.22.0.134/24 brd 172.22.0.255 scope global dynamic noprefixroute enp1s0\ valid_lft 3sec preferred_lft 3sec ]] + ip -o addr show dev enp1s0 scope global + grep -q 172.22.0.3/24 ERROR: "enp1s0" is already set to ip address belong to different subset than "172.22.0.3/24" + echo 'ERROR: "enp1s0" is already set to ip address belong to different subset than "172.22.0.3/24"' + exit 1
The error message is misleading about what is actually checked (apart from the whole subnet/subset typo). It doesn't appear this should ever work for IPv4, since we don't ever expect the Provisioning VIP to appear on the interface before we've set it. (With IPv6 this should often work thanks to an appalling and unsafe hack. Not to suggest that grepping for an IPv4 address complete with .'s in it is safe either.)
Eventually the pod does start up, with this in the log:
+ '[' -z 172.22.0.3/24 ']' + '[' -z enp1s0 ']' + '[' -n enp1s0 ']' ++ ip -o addr show dev enp1s0 scope global + [[ -n '' ]] + /usr/sbin/ip address flush dev enp1s0 scope global + /usr/sbin/ip addr add 172.22.0.3/24 dev enp1s0 valid_lft 300 preferred_lft 300
So essentially this only worked because there are no IP addresses on the provisioning interface.
In the original (error) log the machine's IP 172.22.0.134/24 has a valid lifetime of 3s, so that likely explains why it later disappears. The provisioning network is managed, so the IP address comes from dnsmasq in the former incarnation of the metal3 pod. We effectively prevent the new pod from starting until the DHCP addresses have timed out, even though we will later flush them to ensure no stale ones are left behind.
The check was originally added by https://github.com/openshift/ironic-static-ip-manager/pull/27 but that only describes what it does and not the reason. There's no linked ticket to indicate what the purpose was.
- blocks
-
OCPBUGS-49350 static IP manager crashloops for a while on pod startup
- ON_QA
- is cloned by
-
OCPBUGS-49350 static IP manager crashloops for a while on pod startup
- ON_QA
- is depended on by
-
OCPBUGS-48754 [OCP 4.16] static IP manager crashloops - backport of OCPBUGS-38507 to 4.16
- ASSIGNED
- is duplicated by
-
OCPBUGS-39314 Excessive Restarts on container/metal3-static-ip-set
- Verified
- links to
-
RHEA-2024:6122 OpenShift Container Platform 4.18.z bug fix update