-
Epic
-
Resolution: Unresolved
-
Undefined
-
None
-
None
-
None
-
None
-
[COLLABORATION] Stabilize Multi-Arch Live Migration Testing
-
Critical
-
rhel-virt-core-libvirt-1
-
None
-
False
-
False
-
-
None
-
None
-
None
-
None
-
Unspecified
-
Unspecified
-
Unspecified
-
-
All
-
None
We are requesting a cross-team collaboration to address significant instability in our live migration test suites on non-x86 architectures. Currently, s390x tests are highly brittle. While our primary data concerns s390x, initial observations of ARM test jobs (specifically the migration_modular_X series) suggest they might face the same issue.
The goal of this Epic is to partner with the teams that are responsible for live migration and hardware enablement to move away from timing-dependent tests toward a robust, event-driven framework that works reliably across all platforms.
Problem Statement Recently, the libvirt CI test jobs for migration (priority_2, priority_3) have been removed from the libvirt QE automation scope, forcing our team (s390x) to enable the migration_modular_X (X = 1..5) to enable them (as of today 558 test cases) in order to have what we currently believe is the needed coverage of live migration features on s390x. This has revealed many timing issues that make tests fail but would pass otherwise if tweaked. However, tweaking these parameters doesn't seem appropriate as it leads to the need of readjustments that are not caused by the actual expectations on the product but rather any environmental conditions that are currently not controled or rather hard to control. Each run takes several minutes, making manual "tweak-and-confirm" workflows inefficient for the team.
Note: It seems priority_2, priority_3 were removed because the migration feature area in Libvirt QE migrated to a new test plan and test implementation as result of the Modular test case design project (more material in the drive folder).
Potential Areas for Investigation
These are not prescribed solutions, but rather starting points for a joint technical discussion:
Environment Standardization: Could the standardized environments used for x86_64 (e.g., ci-shared-datas) be adapted for s390x and ARM to provide a consistent baseline? - Does it make sense to monitor network passthrough and other parameters that might have an impact?
Guest-Specific Scaling: Do certain guest characteristics (RAM size, etc.) require architecture-specific handling in the test framework? Should we normalize our VM characteristics? (domain XML, storage size)
Beyond Static Timing: With over 106 files using hardcoded timeouts and 26 using sleep , we could investigate if we can substitute these with "wait_for" patterns or event-driven triggers. (quick and dirty statistics from LLM query on the tp-libvirt/migration test suite). For example, the framework makes use of static sleep times which in this PR tought to have been fixed but another run on CI showed timing might be different. Another example is this PR.
Dynamic Bandwidth Management: Adjusting the --bandwidth parameter such as in this PR has been our approach in the past but it is unreliable for example due to
- CI load variance: during CTC the s390x test environments are likely heavily overcommitted; also, the load of a full migration test job on the host system doesn't seem to be the same as executing a single test in the same setup.
- conflicts with specific test scenario settings: some tests need to use the --bandwidth parameter to control certains scenarios such as switch to postcopy; the bandwidth tweak can't be globally applied
Strategy adjustments? Is the execution of all those jobs required or can this problem be solved differently?
Current Statistics (Pain Points)
To illustrate the scope of the refactoring challenge, a search of the codebase reveals (quick and dirty statistics from LLM query on the tp-libvirt/migration test suite):
108 unique files contain at least one hardcoded timing or bandwidth parameter.
44+ files use migrate_speed or other bandwidth limiters.
106+ files rely on timeout parameters.
Desired Collaborative Outcomes
- Shared Knowledge: Understanding why non-x86 architectures behave differently during the memory/networking migration phases.
- Universal CI Jobs: Stable execution of migration_modular_1 through 5 across x86_64, s390x, and ARM.
- Modernized Test Patterns: A path forward to remove brittle sleep and timeout and bandwidth dependencies.
- is related to
-
RHEL-75344 Migration with ssh transport sometimes fails early with Input/output error
-
- New
-