Uploaded image for project: 'OpenShift Bugs'
  1. OpenShift Bugs
  2. OCPBUGS-15362

keepalived multicast to unicast migration may be halted if a node has more than one role

XMLWordPrintable

    • Icon: Bug Bug
    • Resolution: Duplicate
    • Icon: Undefined Undefined
    • None
    • 4.11
    • None
    • Important
    • No
    • Proposed
    • False
    • Hide

      None

      Show
      None

      Description of problem:

      Let's say we start with a 4.10 a cluster where some nodes have both infra and worker roles (or master and worker roles) while others only have the worker roles.  The infra nodes are managed by their own infra mcp, which inherits the default worker machineconfigs (in the supported way).
      
      In such a cluster, if we upgrade to 4.11, upgrade completes but the keepalived-monitor never performs the migration, because it mistakenly believes that the migration has never ended. We can confirm with error messages like the following:
      
      time="2023-06-23T14:40:00Z" level=info msg="Failed to retrieve upgrade status or Upgrade still running" err="<nil>" upgradeRunning=true
      
      If we temporarily remove one of the roles, so that each node has one role only, the migration eventually happens and ends successfully.
      
      

      Version-Release number of selected component (if applicable):

      Any 4.11, tested on the current latest 4.11.43.
      

      How reproducible:

      Always
      

      Steps to Reproduce:

      1. Install a cluster in 4.10 on a platform that uses keepalived.
      2. Label one worker with "node-role.kubernetes.io/infra" label, so it is both infra and worker (it has both "node-role.kubernetes.io/infra" and "node-role.kubernetes.io/worker" labels).
      3. Create an infra MCP that adopts nodes with the "node-role.kubernetes.io/infra" label
      4. Update to 4.11
      

      Actual results:

      After the upgrade is completed, unicast migration never happens because keepalived-monitor pods believe that the upgrade hasn't finished. We have to remove roles from nodes so that each node has only one role so keepalived-monitor detects the end of the upgrade.
      

      Expected results:

      Unicast migration to happen because the upgrade finished even if there are nodes with more than one role.
      

      Additional info:

      A source code analysis will follow in a bug comment, of why I believe this happens.
      

            bnemec@redhat.com Benjamin Nemec
            rhn-support-palonsor Pablo Alonso Rodriguez
            Zhanqi Zhao Zhanqi Zhao
            Votes:
            0 Vote for this issue
            Watchers:
            4 Start watching this issue

              Created:
              Updated:
              Resolved: