Uploaded image for project: 'OpenShift Bugs'
  1. OpenShift Bugs
  2. OCPBUGS-76597

metal3 provisioningInterface cannot be shared with metallb L2Advertisement interface

    • None
    • False
    • Hide

      None

      Show
      None
    • None
    • None
    • None
    • None
    • None
    • None
    • CNF Network Sprint 285
    • 1
    • None
    • None
    • None
    • None
    • None
    • None
    • None

      Original Problem statement:  External service provided by metallb will periodically become unavailable to clients.
      
      Summary of issue:  When metal3 provisioning is set to "Managed" it will change the IP address of the provisioningInterface to provisioningIP.  When metal3 pod is relocated it will remove the provisioningIP and leave the OCP without an IP on the interface.  If the active metallb speaker (L2 mode) is co-located on the same node it will break the associated service until an IP address is manually restored or the metallb speaker is moved.
      
      Details and reproducer:
      
      Environment: OCP 4.18 consolidated cluster with RHOSO 18
      
      $ oc get nodes
      NAME     STATUS   ROLES                         AGE    VERSION
      rhoso1   Ready    control-plane,master,worker   211d   v1.31.14
      rhoso2   Ready    control-plane,master,worker   211d   v1.31.14
      rhoso3   Ready    control-plane,master,worker   211d   v1.31.14
      
      
      $ oc get clusterversion
      NAME      VERSION   AVAILABLE   PROGRESSING   SINCE   STATUS
      version   4.18.30   True        False         61d     Cluster version is 4.18.30
      
      $ oc get openstackversion
      NAME                      TARGET VERSION            AVAILABLE VERSION         DEPLOYED VERSION
      openstack-control-plane   18.0.15-20251126.192455   18.0.15-20251126.192455   18.0.15-20251126.192455
      
      
      The OCP service(LoadBalancer) is dnsmasq provided on "ctlplane" network as configured by RHOSO operator.
      
      $ oc get service -n openstack -l service=dnsmasq
      NAME          TYPE           CLUSTER-IP      EXTERNAL-IP      PORT(S)                     AGE
      dnsmasq-dns   LoadBalancer   172.30.202.45   192.168.122.80   53:32264/UDP,53:32264/TCP   189d
      
      $ oc describe service dnsmasq-dns 
      Name:                     dnsmasq-dns
      Namespace:                openstack
      Labels:                   service=dnsmasq
      Annotations:              core.openstack.org/ingress_create: false
                                metallb.io/ip-allocated-from-pool: ctlplane
                                metallb.universe.tf/address-pool: ctlplane
                                metallb.universe.tf/allow-shared-ip: ctlplane
                                metallb.universe.tf/loadBalancerIPs: 192.168.122.80
      Selector:                 service=dnsmasq
      [...]
      
      
      Metallb advertisement
      
      apiVersion: metallb.io/v1beta1
      kind: L2Advertisement
      metadata:
        name: ctlplane
        namespace: metallb-system
      spec:
        ipAddressPools:
        - ctlplane
        interfaces:
        - enp2s0
      
      
      NNCP example for ctlplane network:
      
      apiVersion: nmstate.io/v1
      kind: NodeNetworkConfigurationPolicy
      metadata:
        name: nncp-rhoso1
      spec:
        desiredState:
          interfaces:
          - description: Configuring enp2s0
            ipv4:
              address:
              - ip: 192.168.122.10
                prefix-length: 24
              enabled: true
              dhcp: false
            ipv6:
              enabled: false
            mtu: 1500
            name: enp2s0
            state: up
            type: ethernet
      
      Original ctlplane IPs for each node:
      
      $ oc get nns -o yaml |grep 192.168.122
                - ip: 192.168.122.10
                - ip: 192.168.122.11
                - ip: 192.168.122.12
                
      DNS client works fine for external compute node.
      
      [root@compute18-node2 ~]# dig +noall +answer +additional ovsdbserver-sb.openstack.svc @192.168.122.80
      ovsdbserver-sb.openstack.svc. 0 IN      A       172.17.0.33
      ovsdbserver-sb.openstack.svc. 0 IN      A       172.17.0.36
      ovsdbserver-sb.openstack.svc. 0 IN      A       172.17.0.37
      
      Original metal3 Provisioning CR:
      
      apiVersion: v1
      items:
      - apiVersion: metal3.io/v1alpha1
        kind: Provisioning
        metadata:
          finalizers:
          - provisioning.metal3.io
          name: provisioning-configuration
        spec:
          preProvisioningOSDownloadURLs: {}
          provisioningMacAddresses:
          - 52:54:00:67:f0:e8
          - 52:54:00:73:f7:6b
          - 52:54:00:55:37:34
          provisioningNetwork: Disabled
          virtualMediaViaExternalNetwork: true
          watchAllNamespaces: true
      kind: List
      metadata:
        resourceVersion: ""
      
      Apply the following config to change metal3 to 'Managed':
      
      $ cat provisioning-managed.yaml 
      apiVersion: v1
      items:
      - apiVersion: metal3.io/v1alpha1
        kind: Provisioning
        metadata:
          finalizers:
          - provisioning.metal3.io
          name: provisioning-configuration
        spec:
          preProvisioningOSDownloadURLs: {}
          provisioningMacAddresses:
          - 52:54:00:67:f0:e8
          - 52:54:00:73:f7:6b
          - 52:54:00:55:37:34
          provisioningNetwork: Managed
          provisioningInterface: enp2s0
          provisioningIP: 192.168.122.2
          provisioningNetworkCIDR: 192.168.122.0/24
          provisioningDHCPRange: 192.168.122.211,192.168.122.230
          virtualMediaViaExternalNetwork: false
          watchAllNamespaces: true
      kind: List
      metadata:
        resourceVersion: ""
      
      $ oc apply -f provisioning-managed.yaml
      provisioning.metal3.io/provisioning-configuration configured
      
      After the ctlplane IP for one of the OCP node changes to the provisioningIP
      
      $ oc get nns -o yaml |grep 192.168.122
             - ip: 192.168.122.2
             - ip: 192.168.122.11
             - ip: 192.168.122.12
      
      Service is still fine:
      
      [root@compute18-node2 ~]# sudo dig +noall +answer +additional ovsdbserver-sb.openstack.svc
      ovsdbserver-sb.openstack.svc. 0 IN      A       172.17.0.36
      ovsdbserver-sb.openstack.svc. 0 IN      A       172.17.0.37
      ovsdbserver-sb.openstack.svc. 0 IN      A       172.17.0.33
      
      See the mac of the active speaker:
      [root@compute18-node2 ~]# arping -I enp1s0 192.168.122.80 -c 1
      ARPING 192.168.122.80 from 192.168.122.121 enp1s0
      Unicast reply from 192.168.122.80 [52:54:00:67:F0:E8]  0.859ms
      
      Metal3 pod is node: rhoso1
      
      $ oc get pods -n openshift-machine-api  -l k8s-app=metal3  -o wide
      NAME                                          READY   STATUS    RESTARTS   AGE     IP             NODE     NOMINATED NODE   READINESS GATES
      metal3-84c4d8d884-9lkvm                       5/5     Running   0          3m48s   10.28.1.11     rhoso1   <none>           <none>
      
      rhoso1 is the same mac so it is the active speaker (think you can put metallb speaker in debug and see arp responses also):
      
      $ oc exec -ti -n metallb-system -c speaker speaker-lhjjh  -- ip link ls dev enp2s0|grep ether
          link/ether 52:54:00:67:f0:e8 brd ff:ff:ff:ff:ff:ff
          
      To reproduce the issue we will force metal3 to move:
      
      $ oc delete pod -n openshift-machine-api metal3-84c4d8d884-9lkvm
      pod "metal3-84c4d8d884-9lkvm" deleted
      
      metal3 pod is now on rhoso2
      
      $ oc get pods -n openshift-machine-api  -l k8s-app=metal3  -o wide
      NAME                                          READY   STATUS    RESTARTS   AGE   IP             NODE     NOMINATED NODE   READINESS GATES
      metal3-84c4d8d884-h8qwh                       5/5     Running   0          45s   10.28.1.12     rhoso2   <none>           <none>
      
      DNS client is now broken:
      
      [root@compute18-node2 ~]# sudo dig +noall +answer +additional ovsdbserver-sb.openstack.svc @192.168.122.80
      ;; connection timed out; no servers could be reached
      
      192.168.122.x IP is missing from rhoso1 node:
      
      $ oc get nns -o yaml |grep 192.168.122
                - ip: 192.168.122.2
                - ip: 192.168.122.12
      
      $ oc debug node/rhoso1 -- ip  a ls dev enp2s0
      Starting pod/rhoso1-debug-4zl4v ...
      To use host binaries, run `chroot /host`
      3: enp2s0: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc fq_codel state UP group default qlen 1000
          link/ether 52:54:00:67:f0:e8 brd ff:ff:ff:ff:ff:ff
      
      To fix this we force nncp to re-run and setup the original IP for rhoso1 node.
      
      $ oc edit nncp/nncp-rhoso1
      nodenetworkconfigurationpolicy.nmstate.io/nncp-rhoso1 edited
      
      $ oc get nncp nncp-rhoso1
      NAME          STATUS      REASON
      nncp-rhoso1   Available   ConfigurationProgressing
      
      IP is restored and metallb service works again.
      
      $ oc debug node/rhoso1 -- ip  a ls dev enp2s0
      Starting pod/rhoso1-debug-59vhq ...
      To use host binaries, run `chroot /host`
      3: enp2s0: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc fq_codel state UP group default qlen 1000
          link/ether 52:54:00:67:f0:e8 brd ff:ff:ff:ff:ff:ff
          inet 192.168.122.10/24 brd 192.168.122.255 scope global noprefixroute enp2s0
             valid_lft forever preferred_lft forever
      
      [root@compute18-node2 ~]# sudo dig +noall +answer +additional ovsdbserver-sb.openstack.svc @192.168.122.80
      ovsdbserver-sb.openstack.svc. 0 IN      A       172.17.0.33
      ovsdbserver-sb.openstack.svc. 0 IN      A       172.17.0.36
      ovsdbserver-sb.openstack.svc. 0 IN      A       172.17.0.37
      

              eterrell@redhat.com Eduardo Otubo
              mflusche@redhat.com Mathew Flusche
              Jad Haj Yahya Jad Haj Yahya
              None
              Votes:
              0 Vote for this issue
              Watchers:
              7 Start watching this issue

                Created:
                Updated: