Uploaded image for project: 'OpenShift Request For Enhancement'
  1. OpenShift Request For Enhancement
  2. RFE-8310

Ability to change Cluster/Machine Network MTU while minimizing workload disruption

XMLWordPrintable

    • Icon: Feature Request Feature Request
    • Resolution: Unresolved
    • Icon: Undefined Undefined
    • None
    • None
    • Telco Core
    • None
    • Product / Portfolio Work
    • None
    • False
    • Hide

      None

      Show
      None
    • None
    • None
    • None
    • None
    • None
    • None
    • None
    • None

      1. Proposed title of this feature request
      Ability to change cluster MTU while minimizing workload disruption

      2. What is the nature and description of the request?
      Currently the change of cluster MTU is an operation that requires a minimum of two reboots and is executed on all nodes of the cluster at the same time.

      3. Why does the customer need this? (List the business requirements here)

      Customer requires an procedure to change cluster network MTU and machine network MTU that attempts to minimizes workload disruption (e.g., potentially using MCP pools). The main advantages of using a per-Machine Config Pool(MCP)deployment strategy for critical operations such as day-2 MTU updates, are:

      • Granular and phased day-2 MTU updates. Machine Config Pools enable the logical subdivision of OpenShift nodes into different groups based on customer planning parameters (e.g., workload requirements).
      • Minimize service interruption and risk. This controlled deployment method is crucial for minimizing interruption of sensitive workloads, such as telco-grade services, during the necessary node reboots.

      Procedure 1: Not used by the partner but proposed by Engineering as potentially doable. Let's say you have two MCPs for workloads, MCP A (for master nodes), MCP B for appworkers, MCP C (for other appworkers), MCP D for worker nodes of type gateway, and MCP E for storage nodes. A procedure like this should be doable and documented in order to be supported:

      • Pause MCP A, MCP B, MCP C, MCP D, and MCP E.
      • Prepare the cluster for cluster network and potential machine network MTU migration (setting the MTU migration configuration option)
      • Unpause MCP A, it will be updated ( 1 reboot of nodes in MCP A)
      • Unpause MCP B, it will be updated (1 reboot of nodes in MCP B)
      • Unpause MCP C, it will be updated (1 reboot of nodes in MCP C)
      • Unpause MCP D, it will be updated (1 reboot of nodes in MCP D)
      • Unpause MCP E, it will be updated (1 reboot of nodes in MCP E)
      • We have all nodes of the cluster having completed step 1 of the MTU procedure
      • Pause MCP A,B, C, D, E again (if needed)
      • Reconfigure the cluster for the new machine network MTU (Nokia MC is working)
      • Unpause MCP A, it will be updated  (1 reboot)
      • Unpause MCP B, it will be updated  (1 reboot)
      • Unpause MCP C, it will be updated  (1 reboot)
      • Unpause MCP D, it will be updated  (1 reboot)
      • Unpause MCP E, it will be updated  (1 reboot)
      • We have all nodes of the cluster having completed step 2 (optional step only if you need to update machine network MTU) of the MTU procedure
      • Pause MCP A, MCP B, MCP C, MCP D, and MCP E
      • Take the cluster out of MTU migration (unsetting the mtu migration configuration option, finalize the migration)
      • Unpause MCP A, it will be updated
      • Unpause MCP B, it will be updated
      • Unpause MCP C, it will be updated
      • Unpause MCP D, it will be updated
      • Unpause MCP E, it will be updated
      • We have all nodes of the having completed step 3 of the MTU procedure, migration done

      The ask here is to assess the feasibility of procedure 1 .

       

      Both procedures below (procedure 2 and procedure 3) have been proposed and requested by the partner. 

      Procedure 2 (DISCARDED): This procedure requested by the partner proposes to minimize number of reboots by executing the MTU migration steps while the MCPs are paused. For the case of workload cluster, the partner interest is on changing the MTU the MachineNetwork (9000 to 9100) but not the clusterNetwork MTU (though strategy applied in the procedure can be applied for both). 

      For the case of the spokes, these are the MCPs:

      • MCP A for master nodes
      • MCP B for first bunch of worker nodes
      • MCP C for second bunch of worker nodes
        ....
      • MCP N for the N bunch of the worker nodes

      A procedure like this would aim at doing the Machine MTU setting in only 2 reboots and not in three. Would this be doable?

      • Pause MCP A and MCP B, MCP C.. MCP N (one MCP for all the master nodes). All MCPs to be paused.
      • Prepare the cluster for MTU migration (setting the MTU migration configuration option , patching the CNO)
      • Reconfigure the cluster for the new MTU using MachineConfig (in partner scenario contains oneshot service)
      • Unpause MCP A (master nodes) so machine config render of step 2 and 3 are combine and applied on MCP A (so only 1 reboot instead of 2 to reduce workload impact)
      • Wait until all nodes in MCP A are updated
      • Unpause MCP B so machine config render of step 2 and 3 combined and applied on MCP B (so only 1 reboot instead of 2 to reduce Workload impact)
      • Wait until all nodes in MCP B are updated
      • repeat again for the rest of MCPs
      • Pause all MCPs
      • Take the cluster out of MTU migration (unsetting the MTU migration configuration option, patching the CNO)
      • Applied machine config to disable the service configuring the MTU so at the boot time it will not run again
      • Unpause MCP A, it will be updated
      • Unpause MCP B, it will be updated
      • Unpause .....
      • Unpause MCP N
      • We have all nodes having completed step 3 of the MTU procedure, migration done

      Procedure 3: For the case of the Hub, the procedure below will require 4 reboots and the interest is on updating ClusterNetwork MTU and MachineNetwork MTU.

      • Decrease MachineNetwork MTU (from 9126 to 9100)
      • Decrease ClusterNetwork MTU (from 9026 to 8900)

      Sometimes we have 1 MCP (all masters) and sometimes 2 MCPs (but the partner is not pausing anything). Steps follow:

      • Prepare the cluster for MTU migration (setting the MTU migration configuration option , patching the CNO) -> one reboot
      • Reconfigure the cluster for the new MTU using MachineConfig (in partner scenario the MachineConfig is based on a oneshot service) --> one reboot
      • Take the cluster out of MTU migration (unsetting the MTU migration configuration option, patching the CNO) --> one reboot
      • Applied machine config to disable the service configuring the MTU so at the boot time it will not run again --> one reboot

      The ask here is to assess the feasibility of procedures 2 and procedure 3 (or similar ones) where workload service interruption is minimized being supported by Red Hat in official documentation.

      4. List any affected packages or components.

      None.

              fbaudin@redhat.com Franck Baudin
              jnunez@redhat.com Jose Nuñez
              None
              Votes:
              0 Vote for this issue
              Watchers:
              10 Start watching this issue

                Created:
                Updated:
                None
                None