Uploaded image for project: 'OpenShift Bugs'
  1. OpenShift Bugs
  2. OCPBUGS-3712

IPI deployment w/ Provisioning Network & static ip via NMstate fails to deploy

XMLWordPrintable

    • Quality / Stability / Reliability
    • False
    • Hide

      None

      Show
      None
    • 3
    • Important
    • None
    • None
    • Rejected
    • None
    • None
    • None
    • None
    • None
    • None
    • None
    • None

      Description of problem:

      Attempting to perform an IPI baremetal deployment using NMstate to pre-configure the external and provisioning NICs on the cluster.  The master nodes are being deployed however the worker nodes never deploy.  In order to get the deployment to successfully complete so I can look at things, I moved the networking services to the masters and made them double as worker nodes so the IPI deployment finished and I can look at things.
      
      Looking at the deployment, it looks like the workers are failing because the provisioning IP fails to get configured on the master nodes:
      
      kni@gvs01 clusterconfigs]$ oc -n openshift-machine-api get pods
      NAME                                           READY   STATUS                  RESTARTS       AGE
      cluster-autoscaler-operator-66466545bd-n9q2j   2/2     Running                 0              29m
      cluster-baremetal-operator-6966c498d7-j6qmq    2/2     Running                 0              29m
      machine-api-controllers-777cc7c6d5-lmsmd       7/7     Running                 0              5m51s
      machine-api-operator-6d8cf76747-t454f          2/2     Running                 0              29m
      metal3-6647f79d64-5rvt8                        0/7     Init:CrashLoopBackOff   5 (2m1s ago)   5m25s
      metal3-image-cache-5h8q6                       1/1     Running                 0              5m4s
      metal3-image-cache-t5jd4                       1/1     Running                 0              5m4s
      metal3-image-cache-tkkf8                       1/1     Running                 0              5m4s
      metal3-image-customization-6c54bdcd96-cg295    1/1     Running                 0              4m33s
      [kni@gvs01 clusterconfigs]$ oc -n openshift-machine-api logs metal3-6647f79d64-5rvt8
      Defaulted container "metal3-baremetal-operator" out of: metal3-baremetal-operator, metal3-httpd, metal3-ironic, metal3-ramdisk-logs, metal3-ironic-inspector, metal3-static-ip-manager, metal3-dnsmasq, metal3-static-ip-set (init), machine-os-images (init)
      Error from server (BadRequest): container "metal3-baremetal-operator" in pod "metal3-6647f79d64-5rvt8" is waiting to start: PodInitializing
      [kni@gvs01 clusterconfigs]$ oc -n openshift-machine-api logs metal3-6647f79d64-5rvt8 -c metal3-static-ip-set
      + '[' -z 172.22.10.3/24 ']'
      + '[' -z eno12399 ']'
      + '[' -n eno12399 ']'
      ++ ip -o addr show dev eno12399 scope global
      + [[ -n 4: eno12399    inet 172.22.10.21/24 brd 172.22.10.255 scope global noprefixroute eno12399\       valid_lft forever preferred_lft forever ]]
      + ip -o addr show dev eno12399 scope global
      + grep -q 172.22.10.3/24
      + echo 'ERROR: "eno12399" is already set to ip address belong to different subset than "172.22.10.3/24"'
      ERROR: "eno12399" is already set to ip address belong to different subset than "172.22.10.3/24"
      + exit 1
      [kni@gvs01 clusterconfigs]$
      
      I think the ERROR message does not match the test that is being done.  It looks like the test is only checking to see if the IP matches:
      
      if ! ip -o addr show dev "${PROVISIONING_INTERFACE}" scope global | grep -q "${PROVISIONING_IP//::*/}" ; then
            echo "ERROR: \"$PROVISIONING_INTERFACE\" is already set to ip address belong to different subset than \"$PROVISIONING_IP\""
            exit 1
      fi
      
      Logging into a master node shows the NIC was configured properly using NMstate:
      
      [core@openshift-master-0 ~]$ ip a s eno12399
      4: eno12399: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc mq state UP group default qlen 1000
          link/ether b4:96:91:cc:07:9c brd ff:ff:ff:ff:ff:ff
          inet 172.22.10.20/24 brd 172.22.10.255 scope global noprefixroute eno12399
             valid_lft forever preferred_lft forever
      [core@openshift-master-0 ~]$
      
      And functional:
      [root@openshift-master-0 ~]# ping -I eno12399 172.22.10.21
      PING 172.22.10.21 (172.22.10.21) from 172.22.10.20 eno12399: 56(84) bytes of data.
      64 bytes from 172.22.10.21: icmp_seq=1 ttl=64 time=0.146 ms
      64 bytes from 172.22.10.21: icmp_seq=2 ttl=64 time=0.152 ms
      ^C
      --- 172.22.10.21 ping statistics ---
      2 packets transmitted, 2 received, 0% packet loss, time 1036ms
      rtt min/avg/max/mdev = 0.146/0.149/0.152/0.003 ms
      [root@openshift-master-0 ~]#
      
      And routing looks right:
      
      [core@openshift-master-0 ~]$ ip r
      default via 192.168.66.1 dev br-ex proto static metric 48
      10.128.0.0/23 dev ovn-k8s-mp0 proto kernel scope link src 10.128.0.2
      10.128.0.0/14 via 10.128.0.1 dev ovn-k8s-mp0
      169.254.169.0/30 via 192.168.66.1 dev br-ex
      169.254.169.3 via 10.128.0.1 dev ovn-k8s-mp0
      172.22.10.0/24 dev eno12399 proto kernel scope link src 172.22.10.20 metric 100
      172.30.0.0/16 via 192.168.66.1 dev br-ex mtu 1400
      192.168.66.0/25 dev br-ex proto kernel scope link src 192.168.66.20 metric 48
      [core@openshift-master-0 ~]$
      
      And as you can see, there's a route match for the PROVISIONING_IP subnet
      
      

      Version-Release number of selected component (if applicable):

       

      How reproducible:

      100% of time

      Steps to Reproduce:

      1. Create install-config.yaml for IPI deployment using provider network
      2. Ensure using NMstate config for configuring both baremetal and provisioning NIC on cluster nodes
      3. Provide provisioningNetworkInterface in the install-config.yaml even though you provided bootMACaddress for the nodes (The deployment fails with a different error (can't find suitable interfaces) if you don't provide this which may be another issue entirely.
      4. Complete steps detailed in 14.3.8.3. Optional: Configuring network components to run on the control plane otherwise the IPI deployment does not complete at all
      5. Deploy
      
      When deployment is complete, investigate the metal3- pods.

      Actual results:

      Worker nodes never deploy.

      Expected results:

      Worker nodes deploy.

      Additional info:

       

        1. mg.20221116.tgz
          12.41 MB
          Darin Sorrentino

              bnemec@redhat.com Benjamin Nemec
              darin.sorrentino Darin Sorrentino
              None
              None
              Pedro Jose Amoedo Martinez Pedro Jose Amoedo Martinez
              None
              Votes:
              0 Vote for this issue
              Watchers:
              7 Start watching this issue

                Created:
                Updated:
                Resolved: