Uploaded image for project: 'OpenShift Bugs'
  1. OpenShift Bugs
  2. OCPBUGS-3612

[IPI] Baremetal ovs-configure.sh script fails to start secondary bridge br-ex1

XMLWordPrintable

    • Important
    • None
    • CNF Network Sprint 230
    • 1
    • Rejected
    • False
    • Hide

      None

      Show
      None

      Description of problem:

      OCP 4.12 deployments making use of secondary bridge br-ex1 for CNI fail to start ovs-configuration service, with multiple failures.
      

      Version-Release number of selected component (if applicable):

      Openshift 4.12.0-rc.0 (2022-11-10)
      

      How reproducible:

      Until now always at least one node out of four workers fails, not always the same node, sometimes several nodes.
      

      Steps to Reproduce:

      1. Preparing to configure ipi on the provisioning node
         - RHEL 8 ( haproxy, named, mirror registry, rhcos_cache_server ..)
      
      2. configuring the install-config.yaml (attached)
         - provisioningNetwork: enabled
         - machine network: single stack ipv4
         - disconnected installation
         - ovn-kubernetes with hybrid-networking setup
         - LACP bonding setup using MC manifests at day1
           * bond0 -> baremetal 192.168.32.0/24 (br-ex)
           * bond0.662  -> interface for secondary bridge (br-ex1) 192.168.66.128/26
         - secondary bridge defined in /etc/ovnk/extra_bridge using MC Manifest
         
      3. deploy the cluster
      - Usually the deployment is completed
      - Nodes show Ready status, but in some nodes ovs-configuration fails
      - Consequent MC changes fail because MCP cannot roll out configurations in nodes with the failure.
      
      NOTE: This impacts testing of our partners Verizon and F5, because we are validating their CNFs before OCP 4.12 release and we need a secondary bridge for CNI.
      

      Actual results:

      br-ex1 and all its related ovs-ports and interfaces fail to activate, ovs-configuration service fails. 
      

      Expected results:

      br-ex1 and all its related ovs-ports and interfaces succeed to activate, ovs-configuration service starts successfully. 
      

      Additional info:
      1. Nodes and MCP info

      $ oc get nodes
      NAME       STATUS   ROLES                  AGE     VERSION
      master-0   Ready    control-plane,master   7h59m   v1.25.2+f33d98e
      master-1   Ready    control-plane,master   7h59m   v1.25.2+f33d98e
      master-2   Ready    control-plane,master   8h      v1.25.2+f33d98e
      worker-0   Ready    worker                 7h26m   v1.25.2+f33d98e
      worker-1   Ready    worker                 7h25m   v1.25.2+f33d98e
      worker-2   Ready    worker                 7h25m   v1.25.2+f33d98e
      worker-3   Ready    worker                 7h25m   v1.25.2+f33d98e
      $ oc get mcp
      NAME     CONFIG                                             UPDATED   UPDATING   DEGRADED   MACHINECOUNT   READYMACHINECOUNT   UPDATEDMACHINECOUNT   DEGRADEDMACHINECOUNT   AGE                         
      master   rendered-master-210a69a0b40162b2f349ea3a5b5819e5   True      False      False      3              3                   3                     0                      7h57m                       
      worker   rendered-worker-e8a62c86ce16e98e45e3166847484cf0   False     True       True       4              2                   2                     1                      7h57m 
      

      2. When logging it to the nodes via SSH, we see when ovs-configuration fails, and from the ovs-configuration service logs, we see the following error: (full log attached worker-0-ovs-configuration.log)

      $ ssh core@worker-0
      ---
      Last login: Sat Nov 12 21:33:58 2022 from 192.168.62.10
      [systemd]
      Failed Units: 3
        NetworkManager-wait-online.service
        ovs-configuration.service
        stalld.service
      
      [core@worker-0 ~]$ sudo journalctl -u ovs-configuration | less
      ...
      Nov 12 15:27:54 worker-0 configure-ovs.sh[8237]: Error: invalid field 'connection.type'; allowed fields: NAME,UUID,TYPE,TIMESTAMP,TIMESTAMP-REAL,AUTOCONNECT,AUTOCONNECT-PRIORITY,READONLY,DBUS-PATH,ACT>
      Nov 12 15:27:54 worker-0 configure-ovs.sh[5576]: + '[' == vlan ']'
      Nov 12 15:27:54 worker-0 configure-ovs.sh[5576]: /usr/local/bin/configure-ovs.sh: line 178: [: ==: unary operator expected
      Nov 12 15:27:54 worker-0 configure-ovs.sh[8241]: ++ nmcli --get-values connection.type conn show
      Nov 12 15:27:54 worker-0 configure-ovs.sh[8241]: Error: invalid field 'connection.type'; allowed fields: NAME,UUID,TYPE,TIMESTAMP,TIMESTAMP-REAL,AUTOCONNECT,AUTOCONNECT-PRIORITY,READONLY,DBUS-PATH,ACT>
      Nov 12 15:27:54 worker-0 configure-ovs.sh[5576]: + '[' == bond ']'
      Nov 12 15:27:54 worker-0 configure-ovs.sh[5576]: /usr/local/bin/configure-ovs.sh: line 191: [: ==: unary operator expected
      Nov 12 15:27:54 worker-0 configure-ovs.sh[8245]: ++ nmcli --get-values connection.type conn show
      Nov 12 15:27:54 worker-0 configure-ovs.sh[8245]: Error: invalid field 'connection.type'; allowed fields: NAME,UUID,TYPE,TIMESTAMP,TIMESTAMP-REAL,AUTOCONNECT,AUTOCONNECT-PRIORITY,READONLY,DBUS-PATH,ACT>
      Nov 12 15:27:54 worker-0 configure-ovs.sh[5576]: + '[' == team ']'
      Nov 12 15:27:54 worker-0 configure-ovs.sh[5576]: /usr/local/bin/configure-ovs.sh: line 203: [: ==: unary operator expected
      Nov 12 15:27:54 worker-0 configure-ovs.sh[5576]: + iface_type=802-3-ethernet
      Nov 12 15:27:54 worker-0 configure-ovs.sh[5576]: + '[' '!' '' = 0 ']'
      

      3. We observe the failed node (worker-0) has ovs-if-phys1 connection as an ethernet type. But a working node (worker-1) shows a vlan type for the same connection with the vlan info

      [core@worker-0 ~]$ sudo cat /etc/NetworkManager/system-connections/ovs-if-phys1.nmconnection                                                                                                            
      [connection]
      id=ovs-if-phys1
      uuid=aea14dc9-2d0c-4320-9c13-ddf3e64747bf
      type=ethernet
      autoconnect=false
      autoconnect-priority=100
      autoconnect-slaves=1
      interface-name=bond0.662
      master=e61c56f7-f3ba-40f7-a1c1-37921fc6c815
      slave-type=ovs-port
      
      [ethernet]
      cloned-mac-address=B8:83:03:91:C5:2C
      mtu=1500
      
      [ovs-interface]
      type=system
      
      [core@worker-1 ~]$ sudo cat /etc/NetworkManager/system-connections/ovs-if-phys1.nmconnection
      [connection]
      id=ovs-if-phys1
      uuid=9a019885-3cc1-4961-9dfa-6b7f996556c4
      type=vlan
      autoconnect-priority=100
      autoconnect-slaves=1
      interface-name=bond0.662
      master=877acf53-87d7-4cdf-a078-000af4f962c3
      slave-type=ovs-port
      timestamp=1668265640
      
      [ethernet]
      cloned-mac-address=B8:83:03:91:C5:E8
      mtu=9000
      
      [ovs-interface]
      type=system
      
      [vlan]
      flags=1
      id=662
      parent=bond0
      

      4. Another problem we observe is that we specifically disable IPv6 in the the bond0.662 connection, but the generated connection for br-ex1 has ipv6 method-auto, and it should be disabled.

      [core@worker-0 ~]$ sudo cat /etc/NetworkManager/system-connections/bond0.662.nmconnection 
      [connection]
      id=bond0.662
      type=vlan
      interface-name=bond0.662
      autoconnect=true
      autoconnect-priority=99
      
      [vlan]
      parent=bond0
      id=662
      
      [ethernet]
      mtu=9000
      
      [ipv4]
      method=auto
      dhcp-timeout=2147483647
      never-default=true
      
      [ipv6]
      method=disabled
      never-default=true
      
      [core@worker-0 ~]$ sudo cat /etc/NetworkManager/system-connections/br-ex1.nmconnection
      [connection]
      id=br-ex1
      uuid=df67dcd9-4263-4707-9abc-eda16e75ea0d
      type=ovs-bridge
      autoconnect=false
      autoconnect-slaves=1
      interface-name=br-ex1
      
      [ethernet]
      mtu=1500
      
      [ovs-bridge]
      
      [ipv4]
      method=auto
      
      [ipv6]
      addr-gen-mode=stable-privacy
      method=auto
      
      [proxy]
      

      5. All journals, must-gather, some deployment files can be found in our CI console (Login with RedHat SSO) https://www.distributed-ci.io/jobs/46459571-900f-43df-8798-d36b322d26f4/files
      But attached some of the logs to facilitate the task, worker-0 files are from the node with issues with ovs, worker-1 are from a worker that is OK in case you want to compare.

      11_master-bonding.yaml
      11_worker-bonding.yaml
      install-config.yaml
      journal-worker-0.log
      journal-worker-1.log
      must_gather.tar.gz
      sosreport-worker-0-2022-11-12-csbyqfe.tar.xz
      sosreport-worker-1-2022-11-12-ubltjdn.tar.xz
      worker-0-ip-nmcli-info.log
      worker-0-ovs-configuration.log
      worker-1-ip-nmcli-info.log
      worker-1-ovs-configuration.log
      

      Please let us know if you need any additional information.

        1. 11_master-bonding.yaml
          6 kB
        2. 11_worker-bonding.yaml
          6 kB
        3. install-config.yaml
          4 kB
        4. journal-worker-0.log
          41.50 MB
        5. journal-worker-1.log
          3.06 MB
        6. must_gather.tar.gz
          34.94 MB
        7. ocp4.10-journal-worker-0.log
          7.48 MB
        8. ocp4.10-journal-worker-1.log
          7.53 MB
        9. sosreport-worker-0-2022-11-12-csbyqfe.tar.xz
          22.47 MB
        10. sosreport-worker-0-OCPBUGS-3612-2022-12-19-tyfysrp.tar.xz
          17.03 MB
        11. sosreport-worker-1-2022-11-12-ubltjdn.tar.xz
          17.79 MB
        12. worker-0-ip-nmcli-info.log
          12 kB
        13. worker-0-ovs-configuration.log
          432 kB
        14. worker-1-ip-nmcli-info.log
          19 kB
        15. worker-1-ovs-configuration.log
          195 kB

            apanatto@redhat.com Andrea Panattoni
            rhn-gps-manrodri Manuel Rodriguez
            Zhanqi Zhao Zhanqi Zhao
            Votes:
            1 Vote for this issue
            Watchers:
            21 Start watching this issue

              Created:
              Updated:
              Resolved: