Loading...

Type: Bug
Resolution: Unresolved
Priority: Undefined
Fix Version/s: None
Affects Version/s: 4.18
Component/s: Bare Metal Hardware Provisioning / cluster-baremetal-operator
Labels:
None

Activity Type:
None
Blocked:
False
Blocked Reason:

Hide

None

Show
None
Story Points:
None
Severity:
None
Regression:
None

Target Backport Versions:
None
Target Version:
None
Release Blocker:
None
Sprint:
CNF Network Sprint 285
sprint_count:
1

SFDC Cases Counter:
SFDC Cases Open:
SFDC Cases Links:

PX Impact Score:

Release Note Status:
None
Release Note Type:
None
Release Note Text:
None

Escape Reason:
None
Escape Impact:
None
Corrective Measures:
None
SDLC stage when should've been found:
None

Original Problem statement:  External service provided by metallb will periodically become unavailable to clients.

Summary of issue:  When metal3 provisioning is set to "Managed" it will change the IP address of the provisioningInterface to provisioningIP.  When metal3 pod is relocated it will remove the provisioningIP and leave the OCP without an IP on the interface.  If the active metallb speaker (L2 mode) is co-located on the same node it will break the associated service until an IP address is manually restored or the metallb speaker is moved.

Details and reproducer:

Environment: OCP 4.18 consolidated cluster with RHOSO 18

$ oc get nodes
NAME     STATUS   ROLES                         AGE    VERSION
rhoso1   Ready    control-plane,master,worker   211d   v1.31.14
rhoso2   Ready    control-plane,master,worker   211d   v1.31.14
rhoso3   Ready    control-plane,master,worker   211d   v1.31.14


$ oc get clusterversion
NAME      VERSION   AVAILABLE   PROGRESSING   SINCE   STATUS
version   4.18.30   True        False         61d     Cluster version is 4.18.30

$ oc get openstackversion
NAME                      TARGET VERSION            AVAILABLE VERSION         DEPLOYED VERSION
openstack-control-plane   18.0.15-20251126.192455   18.0.15-20251126.192455   18.0.15-20251126.192455


The OCP service(LoadBalancer) is dnsmasq provided on "ctlplane" network as configured by RHOSO operator.

$ oc get service -n openstack -l service=dnsmasq
NAME          TYPE           CLUSTER-IP      EXTERNAL-IP      PORT(S)                     AGE
dnsmasq-dns   LoadBalancer   172.30.202.45   192.168.122.80   53:32264/UDP,53:32264/TCP   189d

$ oc describe service dnsmasq-dns 
Name:                     dnsmasq-dns
Namespace:                openstack
Labels:                   service=dnsmasq
Annotations:              core.openstack.org/ingress_create: false
                          metallb.io/ip-allocated-from-pool: ctlplane
                          metallb.universe.tf/address-pool: ctlplane
                          metallb.universe.tf/allow-shared-ip: ctlplane
                          metallb.universe.tf/loadBalancerIPs: 192.168.122.80
Selector:                 service=dnsmasq
[...]


Metallb advertisement

apiVersion: metallb.io/v1beta1
kind: L2Advertisement
metadata:
  name: ctlplane
  namespace: metallb-system
spec:
  ipAddressPools:
  - ctlplane
  interfaces:
  - enp2s0


NNCP example for ctlplane network:

apiVersion: nmstate.io/v1
kind: NodeNetworkConfigurationPolicy
metadata:
  name: nncp-rhoso1
spec:
  desiredState:
    interfaces:
    - description: Configuring enp2s0
      ipv4:
        address:
        - ip: 192.168.122.10
          prefix-length: 24
        enabled: true
        dhcp: false
      ipv6:
        enabled: false
      mtu: 1500
      name: enp2s0
      state: up
      type: ethernet

Original ctlplane IPs for each node:

$ oc get nns -o yaml |grep 192.168.122
          - ip: 192.168.122.10
          - ip: 192.168.122.11
          - ip: 192.168.122.12
          
DNS client works fine for external compute node.

[root@compute18-node2 ~]# dig +noall +answer +additional ovsdbserver-sb.openstack.svc @192.168.122.80
ovsdbserver-sb.openstack.svc. 0 IN      A       172.17.0.33
ovsdbserver-sb.openstack.svc. 0 IN      A       172.17.0.36
ovsdbserver-sb.openstack.svc. 0 IN      A       172.17.0.37

Original metal3 Provisioning CR:

apiVersion: v1
items:
- apiVersion: metal3.io/v1alpha1
  kind: Provisioning
  metadata:
    finalizers:
    - provisioning.metal3.io
    name: provisioning-configuration
  spec:
    preProvisioningOSDownloadURLs: {}
    provisioningMacAddresses:
    - 52:54:00:67:f0:e8
    - 52:54:00:73:f7:6b
    - 52:54:00:55:37:34
    provisioningNetwork: Disabled
    virtualMediaViaExternalNetwork: true
    watchAllNamespaces: true
kind: List
metadata:
  resourceVersion: ""

Apply the following config to change metal3 to 'Managed':

$ cat provisioning-managed.yaml 
apiVersion: v1
items:
- apiVersion: metal3.io/v1alpha1
  kind: Provisioning
  metadata:
    finalizers:
    - provisioning.metal3.io
    name: provisioning-configuration
  spec:
    preProvisioningOSDownloadURLs: {}
    provisioningMacAddresses:
    - 52:54:00:67:f0:e8
    - 52:54:00:73:f7:6b
    - 52:54:00:55:37:34
    provisioningNetwork: Managed
    provisioningInterface: enp2s0
    provisioningIP: 192.168.122.2
    provisioningNetworkCIDR: 192.168.122.0/24
    provisioningDHCPRange: 192.168.122.211,192.168.122.230
    virtualMediaViaExternalNetwork: false
    watchAllNamespaces: true
kind: List
metadata:
  resourceVersion: ""

$ oc apply -f provisioning-managed.yaml
provisioning.metal3.io/provisioning-configuration configured

After the ctlplane IP for one of the OCP node changes to the provisioningIP

$ oc get nns -o yaml |grep 192.168.122
       - ip: 192.168.122.2
       - ip: 192.168.122.11
       - ip: 192.168.122.12

Service is still fine:

[root@compute18-node2 ~]# sudo dig +noall +answer +additional ovsdbserver-sb.openstack.svc
ovsdbserver-sb.openstack.svc. 0 IN      A       172.17.0.36
ovsdbserver-sb.openstack.svc. 0 IN      A       172.17.0.37
ovsdbserver-sb.openstack.svc. 0 IN      A       172.17.0.33

See the mac of the active speaker:
[root@compute18-node2 ~]# arping -I enp1s0 192.168.122.80 -c 1
ARPING 192.168.122.80 from 192.168.122.121 enp1s0
Unicast reply from 192.168.122.80 [52:54:00:67:F0:E8]  0.859ms

Metal3 pod is node: rhoso1

$ oc get pods -n openshift-machine-api  -l k8s-app=metal3  -o wide
NAME                                          READY   STATUS    RESTARTS   AGE     IP             NODE     NOMINATED NODE   READINESS GATES
metal3-84c4d8d884-9lkvm                       5/5     Running   0          3m48s   10.28.1.11     rhoso1   <none>           <none>

rhoso1 is the same mac so it is the active speaker (think you can put metallb speaker in debug and see arp responses also):

$ oc exec -ti -n metallb-system -c speaker speaker-lhjjh  -- ip link ls dev enp2s0|grep ether
    link/ether 52:54:00:67:f0:e8 brd ff:ff:ff:ff:ff:ff
    
To reproduce the issue we will force metal3 to move:

$ oc delete pod -n openshift-machine-api metal3-84c4d8d884-9lkvm
pod "metal3-84c4d8d884-9lkvm" deleted

metal3 pod is now on rhoso2

$ oc get pods -n openshift-machine-api  -l k8s-app=metal3  -o wide
NAME                                          READY   STATUS    RESTARTS   AGE   IP             NODE     NOMINATED NODE   READINESS GATES
metal3-84c4d8d884-h8qwh                       5/5     Running   0          45s   10.28.1.12     rhoso2   <none>           <none>

DNS client is now broken:

[root@compute18-node2 ~]# sudo dig +noall +answer +additional ovsdbserver-sb.openstack.svc @192.168.122.80
;; connection timed out; no servers could be reached

192.168.122.x IP is missing from rhoso1 node:

$ oc get nns -o yaml |grep 192.168.122
          - ip: 192.168.122.2
          - ip: 192.168.122.12

$ oc debug node/rhoso1 -- ip  a ls dev enp2s0
Starting pod/rhoso1-debug-4zl4v ...
To use host binaries, run `chroot /host`
3: enp2s0: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc fq_codel state UP group default qlen 1000
    link/ether 52:54:00:67:f0:e8 brd ff:ff:ff:ff:ff:ff

To fix this we force nncp to re-run and setup the original IP for rhoso1 node.

$ oc edit nncp/nncp-rhoso1
nodenetworkconfigurationpolicy.nmstate.io/nncp-rhoso1 edited

$ oc get nncp nncp-rhoso1
NAME          STATUS      REASON
nncp-rhoso1   Available   ConfigurationProgressing

IP is restored and metallb service works again.

$ oc debug node/rhoso1 -- ip  a ls dev enp2s0
Starting pod/rhoso1-debug-59vhq ...
To use host binaries, run `chroot /host`
3: enp2s0: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc fq_codel state UP group default qlen 1000
    link/ether 52:54:00:67:f0:e8 brd ff:ff:ff:ff:ff:ff
    inet 192.168.122.10/24 brd 192.168.122.255 scope global noprefixroute enp2s0
       valid_lft forever preferred_lft forever

[root@compute18-node2 ~]# sudo dig +noall +answer +additional ovsdbserver-sb.openstack.svc @192.168.122.80
ovsdbserver-sb.openstack.svc. 0 IN      A       172.17.0.33
ovsdbserver-sb.openstack.svc. 0 IN      A       172.17.0.36
ovsdbserver-sb.openstack.svc. 0 IN      A       172.17.0.37

Details

Description

Attachments

Easy Agile Planning Poker

Activity

People

Dates