Uploaded image for project: 'Red Hat OpenStack Services on OpenShift'
  1. Red Hat OpenStack Services on OpenShift
  2. OSPRH-25424

Watcher zone_migration strategy experiences significant performance degradation with large action plans.

XMLWordPrintable

    • Icon: Bug Bug
    • Resolution: Unresolved
    • Icon: Undefined Undefined
    • None
    • rhos-18.0.14 FR 4
    • openstack-watcher
    • None
    • Important

      We are performing a post adoption RHEL minor upgrade (9.2 to 9.4) on an OpenStack environment with 250+ nodes and approximately 10,000+ VMs. So, we are using Watcher's zone_migration strategy for planned hardware maintenance.

      We intended to use zone_migration strategy for:

      • Evacuate compute nodes in batches (50 nodes per batch)
      • Migrate ~2500 VMs in a batch to buffer nodes(using multiple audits)
      • Upgrade RHEL and migrate VMs back.

      During the testing of zone_migration strategy for live migrations of VMs for compute node evacuation, we observed significant performance drop that may be a bottleneck for large scale migrations. Our test with 52 VMs in a single audit/actionplan took approximately 4 hours, which is actually expected to take ~30 - 60min using Nova CLI when migration is performed serially(with concurrency nova will take much less time).

      Command that was used for an audit creation:

      // Audit creation
      time openstack optimize audit create -a WatcherBufferEvacuation -t ONESHOT -p compute_nodes='[{"src_node":"computer660-0.redhat.local", "dst_node":"computer660-66.redhat.local"}]' -p parallel_total=2500 -p parallel_per_node=100 -p with_attached_volume=true --name trial-0-to-buffer-66-test
      
      // Note: Although large values like 2500 & 100 are passed to parallel_total and parallel_per_node, I believe that they are not the source of this performance degradation. They are just used for hard limits on the number of actions that can be part of an actionplan.
      

       

      Actionplan:

      sh-5.1$ openstack optimize actionplan list | grep -i 53cbeda6-c4c4-47e0-8394-256a0549bde1
      | 53cbeda6-c4c4-47e0-8394-256a0549bde1 | b256429f-f2fe-457b-a164-6d529a7f3cd4 | SUCCEEDED | 2026-01-17T12:19:35+00:00 | Live_instance_migrate_ratio: 100.00 % | 
      sh-5.1$ 

       

      Actions list:

      sh-5.1$ openstack optimize action list --action-plan 53cbeda6-c4c4-47e0-8394-256a0549bde1 | head -7
      +--------------------------------------+----------------------------------------------------------------------------------+-----------+--------------------------------------+---------+
      | UUID                                 | Parents                                                                          | State     | Action Plan                          | Action  |
      +--------------------------------------+----------------------------------------------------------------------------------+-----------+--------------------------------------+---------+
      | 4949bdf3-4cdd-44da-95e5-8ffe2fd1427b | []                                                                               | SUCCEEDED | 53cbeda6-c4c4-47e0-8394-256a0549bde1 | migrate |
      | 3973b9a4-f3f7-45a7-84fe-500972e1b6c5 | []                                                                               | SUCCEEDED | 53cbeda6-c4c4-47e0-8394-256a0549bde1 | migrate |
      | 98cd7236-1b62-4934-89b4-0c218af880d0 | ['4949bdf3-4cdd-44da-95e5-8ffe2fd1427b', '3973b9a4-f3f7-45a7-84fe-500972e1b6c5'] | SUCCEEDED | 53cbeda6-c4c4-47e0-8394-256a0549bde1 | migrate |
      | f3e1680e-0645-46bc-a974-85bcb2dfa72a | ['4949bdf3-4cdd-44da-95e5-8ffe2fd1427b', '3973b9a4-f3f7-45a7-84fe-500972e1b6c5'] | SUCCEEDED | 53cbeda6-c4c4-47e0-8394-256a0549bde1 | migrate |
      
      
      sh-5.1$ openstack optimize action list --action-plan 53cbeda6-c4c4-47e0-8394-256a0549bde1 | grep -i succeeded | wc -l
      52

      So, here in this audit/actionplan... migrations happen from source compute "computer660-0.redhat.local" to destination "computer660-66.redhat.local".
      Number of VMs being migrated: 52.
      Time taken for the completion of all actions in an actionplan: ~14457 seconds.

      [root@e18-h18-000-r660 ~]# PLAN_UUID=53cbeda6-c4c4-47e0-8394-256a0549bde1
      [root@e18-h18-000-r660 ~]# START=$(oc exec -n openstack openstackclient -- openstack optimize actionplan show $PLAN_UUID -f value -c "Created At")
      [root@e18-h18-000-r660 ~]# END=$(oc exec -n openstack openstackclient -- openstack optimize actionplan show $PLAN_UUID -f value -c "Updated At")
      [root@e18-h18-000-r660 ~]# ELAPSED=$(( $(date -d "$END" +%s) - $(date -d "$START" +%s) ))
      [root@e18-h18-000-r660 ~]# echo $ELAPSED
      14457 
      
      // We have also measured the time using a script that monitors this actionplan which reported "14256" seconds.

       

      Cpu and Memory usage of Watcher during and after the migration:

      // during migration watcher applier using ~1 core cpu
      [root@e18-h18-000-r660 ~]# oc adm top pod | grep -i watcher
      watcher-api-0                   14m          292Mi           
      watcher-applier-0               997m         251Mi           
      watcher-decision-engine-0       71m          237Mi           
      
      // after migration
      [root@e18-h18-000-r660 ~]# oc adm top pod | grep -i watcher
      watcher-api-0                   3m           292Mi           
      watcher-applier-0               2m           251Mi           
      watcher-decision-engine-0       1m           235Mi 

      RHOSO version:

      [root@e18-h18-000-r660 ~]# oc get openstackversion
      NAME        TARGET VERSION            AVAILABLE VERSION         DEPLOYED VERSION
      openstack   18.0.15-20251126.192455   18.0.15-20251126.192455   18.0.15-20251126.192455 

      Watcher status:

      [root@e18-h18-000-r660 ~]# oc get watcher
      NAME      STATUS   MESSAGE
      watcher   True     Setup complete 

      I’ll share additional information and logs in the comments to keep this description concise.

       

       

       

              amoralej1@redhat.com Alfredo Moralejo Alonso
              rpulapak@redhat.com Rajesh Pulapakula
              rhos-workloads-evolution
              Votes:
              0 Vote for this issue
              Watchers:
              3 Start watching this issue

                Created:
                Updated: