Original Problem statement: External service provided by metallb will periodically become unavailable to clients.
Summary of issue: When metal3 provisioning is set to "Managed" it will change the IP address of the provisioningInterface to provisioningIP. When metal3 pod is relocated it will remove the provisioningIP and leave the OCP without an IP on the interface. If the active metallb speaker (L2 mode) is co-located on the same node it will break the associated service until an IP address is manually restored or the metallb speaker is moved.
Details and reproducer:
Environment: OCP 4.18 consolidated cluster with RHOSO 18
$ oc get nodes
NAME STATUS ROLES AGE VERSION
rhoso1 Ready control-plane,master,worker 211d v1.31.14
rhoso2 Ready control-plane,master,worker 211d v1.31.14
rhoso3 Ready control-plane,master,worker 211d v1.31.14
$ oc get clusterversion
NAME VERSION AVAILABLE PROGRESSING SINCE STATUS
version 4.18.30 True False 61d Cluster version is 4.18.30
$ oc get openstackversion
NAME TARGET VERSION AVAILABLE VERSION DEPLOYED VERSION
openstack-control-plane 18.0.15-20251126.192455 18.0.15-20251126.192455 18.0.15-20251126.192455
The OCP service(LoadBalancer) is dnsmasq provided on "ctlplane" network as configured by RHOSO operator.
$ oc get service -n openstack -l service=dnsmasq
NAME TYPE CLUSTER-IP EXTERNAL-IP PORT(S) AGE
dnsmasq-dns LoadBalancer 172.30.202.45 192.168.122.80 53:32264/UDP,53:32264/TCP 189d
$ oc describe service dnsmasq-dns
Name: dnsmasq-dns
Namespace: openstack
Labels: service=dnsmasq
Annotations: core.openstack.org/ingress_create: false
metallb.io/ip-allocated-from-pool: ctlplane
metallb.universe.tf/address-pool: ctlplane
metallb.universe.tf/allow-shared-ip: ctlplane
metallb.universe.tf/loadBalancerIPs: 192.168.122.80
Selector: service=dnsmasq
[...]
Metallb advertisement
apiVersion: metallb.io/v1beta1
kind: L2Advertisement
metadata:
name: ctlplane
namespace: metallb-system
spec:
ipAddressPools:
- ctlplane
interfaces:
- enp2s0
NNCP example for ctlplane network:
apiVersion: nmstate.io/v1
kind: NodeNetworkConfigurationPolicy
metadata:
name: nncp-rhoso1
spec:
desiredState:
interfaces:
- description: Configuring enp2s0
ipv4:
address:
- ip: 192.168.122.10
prefix-length: 24
enabled: true
dhcp: false
ipv6:
enabled: false
mtu: 1500
name: enp2s0
state: up
type: ethernet
Original ctlplane IPs for each node:
$ oc get nns -o yaml |grep 192.168.122
- ip: 192.168.122.10
- ip: 192.168.122.11
- ip: 192.168.122.12
DNS client works fine for external compute node.
[root@compute18-node2 ~]# dig +noall +answer +additional ovsdbserver-sb.openstack.svc @192.168.122.80
ovsdbserver-sb.openstack.svc. 0 IN A 172.17.0.33
ovsdbserver-sb.openstack.svc. 0 IN A 172.17.0.36
ovsdbserver-sb.openstack.svc. 0 IN A 172.17.0.37
Original metal3 Provisioning CR:
apiVersion: v1
items:
- apiVersion: metal3.io/v1alpha1
kind: Provisioning
metadata:
finalizers:
- provisioning.metal3.io
name: provisioning-configuration
spec:
preProvisioningOSDownloadURLs: {}
provisioningMacAddresses:
- 52:54:00:67:f0:e8
- 52:54:00:73:f7:6b
- 52:54:00:55:37:34
provisioningNetwork: Disabled
virtualMediaViaExternalNetwork: true
watchAllNamespaces: true
kind: List
metadata:
resourceVersion: ""
Apply the following config to change metal3 to 'Managed':
$ cat provisioning-managed.yaml
apiVersion: v1
items:
- apiVersion: metal3.io/v1alpha1
kind: Provisioning
metadata:
finalizers:
- provisioning.metal3.io
name: provisioning-configuration
spec:
preProvisioningOSDownloadURLs: {}
provisioningMacAddresses:
- 52:54:00:67:f0:e8
- 52:54:00:73:f7:6b
- 52:54:00:55:37:34
provisioningNetwork: Managed
provisioningInterface: enp2s0
provisioningIP: 192.168.122.2
provisioningNetworkCIDR: 192.168.122.0/24
provisioningDHCPRange: 192.168.122.211,192.168.122.230
virtualMediaViaExternalNetwork: false
watchAllNamespaces: true
kind: List
metadata:
resourceVersion: ""
$ oc apply -f provisioning-managed.yaml
provisioning.metal3.io/provisioning-configuration configured
After the ctlplane IP for one of the OCP node changes to the provisioningIP
$ oc get nns -o yaml |grep 192.168.122
- ip: 192.168.122.2
- ip: 192.168.122.11
- ip: 192.168.122.12
Service is still fine:
[root@compute18-node2 ~]# sudo dig +noall +answer +additional ovsdbserver-sb.openstack.svc
ovsdbserver-sb.openstack.svc. 0 IN A 172.17.0.36
ovsdbserver-sb.openstack.svc. 0 IN A 172.17.0.37
ovsdbserver-sb.openstack.svc. 0 IN A 172.17.0.33
See the mac of the active speaker:
[root@compute18-node2 ~]# arping -I enp1s0 192.168.122.80 -c 1
ARPING 192.168.122.80 from 192.168.122.121 enp1s0
Unicast reply from 192.168.122.80 [52:54:00:67:F0:E8] 0.859ms
Metal3 pod is node: rhoso1
$ oc get pods -n openshift-machine-api -l k8s-app=metal3 -o wide
NAME READY STATUS RESTARTS AGE IP NODE NOMINATED NODE READINESS GATES
metal3-84c4d8d884-9lkvm 5/5 Running 0 3m48s 10.28.1.11 rhoso1 <none> <none>
rhoso1 is the same mac so it is the active speaker (think you can put metallb speaker in debug and see arp responses also):
$ oc exec -ti -n metallb-system -c speaker speaker-lhjjh -- ip link ls dev enp2s0|grep ether
link/ether 52:54:00:67:f0:e8 brd ff:ff:ff:ff:ff:ff
To reproduce the issue we will force metal3 to move:
$ oc delete pod -n openshift-machine-api metal3-84c4d8d884-9lkvm
pod "metal3-84c4d8d884-9lkvm" deleted
metal3 pod is now on rhoso2
$ oc get pods -n openshift-machine-api -l k8s-app=metal3 -o wide
NAME READY STATUS RESTARTS AGE IP NODE NOMINATED NODE READINESS GATES
metal3-84c4d8d884-h8qwh 5/5 Running 0 45s 10.28.1.12 rhoso2 <none> <none>
DNS client is now broken:
[root@compute18-node2 ~]# sudo dig +noall +answer +additional ovsdbserver-sb.openstack.svc @192.168.122.80
;; connection timed out; no servers could be reached
192.168.122.x IP is missing from rhoso1 node:
$ oc get nns -o yaml |grep 192.168.122
- ip: 192.168.122.2
- ip: 192.168.122.12
$ oc debug node/rhoso1 -- ip a ls dev enp2s0
Starting pod/rhoso1-debug-4zl4v ...
To use host binaries, run `chroot /host`
3: enp2s0: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc fq_codel state UP group default qlen 1000
link/ether 52:54:00:67:f0:e8 brd ff:ff:ff:ff:ff:ff
To fix this we force nncp to re-run and setup the original IP for rhoso1 node.
$ oc edit nncp/nncp-rhoso1
nodenetworkconfigurationpolicy.nmstate.io/nncp-rhoso1 edited
$ oc get nncp nncp-rhoso1
NAME STATUS REASON
nncp-rhoso1 Available ConfigurationProgressing
IP is restored and metallb service works again.
$ oc debug node/rhoso1 -- ip a ls dev enp2s0
Starting pod/rhoso1-debug-59vhq ...
To use host binaries, run `chroot /host`
3: enp2s0: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc fq_codel state UP group default qlen 1000
link/ether 52:54:00:67:f0:e8 brd ff:ff:ff:ff:ff:ff
inet 192.168.122.10/24 brd 192.168.122.255 scope global noprefixroute enp2s0
valid_lft forever preferred_lft forever
[root@compute18-node2 ~]# sudo dig +noall +answer +additional ovsdbserver-sb.openstack.svc @192.168.122.80
ovsdbserver-sb.openstack.svc. 0 IN A 172.17.0.33
ovsdbserver-sb.openstack.svc. 0 IN A 172.17.0.36
ovsdbserver-sb.openstack.svc. 0 IN A 172.17.0.37