Uploaded image for project: 'OpenShift Bugs'
  1. OpenShift Bugs
  2. OCPBUGS-15362

keepalived multicast to unicast migration may be halted if a node has more than one role


    • Icon: Bug Bug
    • Resolution: Duplicate
    • Icon: Undefined Undefined
    • None
    • 4.11
    • None
    • Important
    • No
    • Proposed
    • False
    • Hide



      Description of problem:

      Let's say we start with a 4.10 a cluster where some nodes have both infra and worker roles (or master and worker roles) while others only have the worker roles.  The infra nodes are managed by their own infra mcp, which inherits the default worker machineconfigs (in the supported way).
      In such a cluster, if we upgrade to 4.11, upgrade completes but the keepalived-monitor never performs the migration, because it mistakenly believes that the migration has never ended. We can confirm with error messages like the following:
      time="2023-06-23T14:40:00Z" level=info msg="Failed to retrieve upgrade status or Upgrade still running" err="<nil>" upgradeRunning=true
      If we temporarily remove one of the roles, so that each node has one role only, the migration eventually happens and ends successfully.

      Version-Release number of selected component (if applicable):

      Any 4.11, tested on the current latest 4.11.43.

      How reproducible:


      Steps to Reproduce:

      1. Install a cluster in 4.10 on a platform that uses keepalived.
      2. Label one worker with "node-role.kubernetes.io/infra" label, so it is both infra and worker (it has both "node-role.kubernetes.io/infra" and "node-role.kubernetes.io/worker" labels).
      3. Create an infra MCP that adopts nodes with the "node-role.kubernetes.io/infra" label
      4. Update to 4.11

      Actual results:

      After the upgrade is completed, unicast migration never happens because keepalived-monitor pods believe that the upgrade hasn't finished. We have to remove roles from nodes so that each node has only one role so keepalived-monitor detects the end of the upgrade.

      Expected results:

      Unicast migration to happen because the upgrade finished even if there are nodes with more than one role.

      Additional info:

      A source code analysis will follow in a bug comment, of why I believe this happens.

            bnemec@redhat.com Benjamin Nemec
            rhn-support-palonsor Pablo Alonso Rodriguez
            Zhanqi Zhao Zhanqi Zhao
            0 Vote for this issue
            4 Start watching this issue