Uploaded image for project: 'OpenShift Bugs'
  1. OpenShift Bugs
  2. OCPBUGS-55228

nodeip-configuration.service must run after nmstate

XMLWordPrintable

    • Icon: Bug Bug
    • Resolution: Done
    • Icon: Undefined Undefined
    • 4.19
    • 4.13, 4.12, 4.14, 4.15, 4.16, 4.17, 4.18
    • None
    • Quality / Stability / Reliability
    • False
    • Hide

      None

      Show
      None
    • 5
    • Moderate
    • None
    • None
    • None
    • None
    • OSDOCS Sprint 270
    • 1
    • None
    • None
    • None
    • None
    • None
    • None
    • None

      Description of problem:

      nodeip-configuration.service is not configured to run after nmstate service. So, is to be configured via nmstate service (e.g. as per https://docs.redhat.com/en/documentation/openshift_container_platform/4.18/html/installing_on_bare_metal/installer-provisioned-infrastructure#ipi-install-establishing-communication-between-subnets_ipi-install-installation-workflow).
      
      This can be problematic in bond scenarios, where the following can happen:
      - Node gets some networking configuration via networkmanager before nmstate can setup the connections properly. Because of that, some interface temporarily has a wrong IP, namely IP1.
      - nodeip-configuration service runs and it wrongly deems IP1 to be the right node IP
      - Then nmstate runs and setups the right networking. This means that, for example, we have a bond whose IP is IP2. IP2 is the right IP for the node, not IP1.
      - However, as nodeip-configuration service ran before nmstate and stored the wrong IP1 in /run/nodeip-configuration/primary-ip , wait-for-primary-ip service blocks forever trying to use the wrong IP1 instead of the right IP2.
      
      Possible workarounds:
      - Reboot the node
      - Write the right IP in /run/nodeip-configuration/primary-ip and manually restart wait-for-primary-ip service
      
      Proposed solution: Make nodeip-configuration.service to run after nmstate service to ensure that nmstate config is applied.
      

      Version-Release number of selected component (if applicable):

      4.18.7
      

      How reproducible:

      Often (race condition) as long as steps to reproduce are met.
      

      Steps to Reproduce:

      1. Start with a cluster that has a bond interface defined via host-level nmstate in machine-config, like as per https://docs.redhat.com/en/documentation/openshift_container_platform/4.18/html/installing_on_bare_metal/installer-provisioned-infrastructure#ipi-install-establishing-communication-between-subnets_ipi-install-installation-workflow . The environment uses DHCP and the DHCP server might assign IPs to the individual interfaces that are part of the bond before the bond is configured via nmstate.
      2. Add a new node 
      3. 
      

      Actual results:

      Quite often, nodes end up stuck forever in wait-for-primary-ip service because nodeip-configuration ran too early, so it wrote one of the wrong IPs that the node interfaces temporarily had.
      

      Expected results:

      nodeip-configuration to run after nmstate service, so that we are 100% sure that it runs with the right definitive network configuration.
      

      Additional info:

      I found this while doing an internal reproducer, but I wouldn't be surprised if some production user eventually finds this too.
      

              dfitzmau@redhat.com Darragh Fitzmaurice
              rhn-support-palonsor Pablo Alonso Rodriguez
              None
              None
              Ross Brattain Ross Brattain
              None
              Votes:
              0 Vote for this issue
              Watchers:
              5 Start watching this issue

                Created:
                Updated:
                Resolved: