-
Epic
-
Resolution: Unresolved
-
Major
-
None
-
None
-
Testing: Validate that network datapath is not affected during RHOSO 17->18 adoption
-
False
-
-
False
-
Not Selected
-
Proposed
-
No Docs Impact
-
To Do
-
Proposed
-
Proposed
-
100% To Do, 0% In Progress, 0% Done
-
-
During RHOSO 18 adoption process, workloads stay in place on RHEL nodes while control plane is moved to OpenShift pods. It's important that network connectivity for the workloads is not affected during this procedure.
We lack automated test coverage for this. This Epic is going to address it.
The general plan here is to use test-operator and its tobiko framework integration for the job. The plan would comprise of:
- Initial state: OSP17.
- Deploy OpenShift.
- Deploy test-operator. Configure it for tobiko. Make sure that no OpenStack CP services are required by it!
- Execute the "first" part of tobiko scenarios that will prepare necessary OSP resources and start any needed commands in VMs or elsewhere. The commands will monitor datapath during the whole process.
- Proceed with adoption.
- Complete test-operator tobiko execution. This will include reaching out back to previously prepared resources and commands and logs collected by them and making sure that the results are in line with what we promise to customers. (No downtime.)
Tobiko allows to separate scenarios into two pieces (in contrast to tempest), that's why this framework is being considered here - to be able to start test resources and processes before adoption is triggered but collect results afterwards.
While the framework allows for these test scenarios, I don't think we ever added any that would do that. Some work may be needed in the framework to make the idea actually work.
There are two major classes of tests we would like to execute:
- East-West connectivity. This will imply starting multiple VMs and running monitoring commands inside them through ssh or otherwise.
- North-South connectivity. This will imply connection to a VM or several VMs from outside the cluster. It probably means that tobiko will have to run a monitoring process / protocol peer command outside OpenStack cluster. (It may make sense to do that as a pod in OpenShift cluster, but it may be a subject to discussion.)
For the former, the existing tobiko implementation should be enough. (Only a matter of adding new scenarios.) For the latter, tobiko will have to be expanded to support running background processes in pods. Also, we may have to deal with building an image that would contain necessary test tools we'd like to use.
As to some ideas of tests or areas of concerns that could be covered, (copying from a slack discussion, can be expanded during refinement)
- highload long running connections (iperf3 udp/tcp) - both N-S and E-W
- conntrack behavior
- whether new connections are accepted
- whether established connections are not dropped
- the latter may be implicitly covered with iperf test
- DNS/DHCP
- IPv6 Neighbour Advertisements not affected
- point of concern - implemented as controller() action upcalling to ovn-controller
- metadata is working
- curl in a loop from inside the VM
- (ipv6 too?)
- Octavia LB health checks
- unsure, need to reach out to VANS folks for guidance and to check if they are interested
Some info can be found in this Slack thread: https://redhat-internal.slack.com/archives/C046JULBVJ7/p1725977625811099?thread_ts=1725970634.193629&cid=C046JULBVJ7
UPD: another consideration is the effect of a router failover that may be triggered by OVS/OVN restart on N-S connectivity for VMs. This consideration is not unique to adoption / update flow, since router failovers may happen for any number of reasons. We have some whitebox neutron plugin tests that trigger a failover and then validate that the network path is updated, but these tests are not enforcing any SLA on how quick the switch happens. See: https://opendev.org/x/whitebox-neutron-tempest-plugin/src/commit/ea8a27a475ab9a216fe2708ceb5c99c2535a0a2c/whitebox_neutron_tempest_plugin/tests/scenario/test_l3ha_ovn.py The test case may need an expansion to enforce some reasonable downtime SLA.