Loading...

XML

Word

Printable

Type: Epic
Resolution: Unresolved
Priority: Major
Fix Version/s: None
Affects Version/s: None
Component/s: test-operator
Labels:
- adoption

Epic Name:
Testing: Validate that network datapath is not affected during RHOSO 17->18 adoption
Blocked:
False
Blocked Reason:

Hide

None

Show
None
Ready:
False
Color Status:
Not Selected
Dev Approval:
Proposed
Docs Approval:
No Docs Impact
Epic Status:
To Do
PM Approval:
Proposed
QE Approval:
Proposed
Hierarchy Progress Bar:

100% To Do, 0% In Progress, 0% Done
Intelligence Requested:
Market:

Target Version:

rhos-18.0 Feature Release 2 (Mar 2025)

SFDC Cases Links:
SFDC Cases Counter:
SFDC Cases Open:

During RHOSO 18 adoption process, workloads stay in place on RHEL nodes while control plane is moved to OpenShift pods. It's important that network connectivity for the workloads is not affected during this procedure.

We lack automated test coverage for this. This Epic is going to address it.

The general plan here is to use test-operator and its tobiko framework integration for the job. The plan would comprise of:

Initial state: OSP17.
Deploy OpenShift.
Deploy test-operator. Configure it for tobiko. Make sure that no OpenStack CP services are required by it!
Execute the "first" part of tobiko scenarios that will prepare necessary OSP resources and start any needed commands in VMs or elsewhere. The commands will monitor datapath during the whole process.
Proceed with adoption.
Complete test-operator tobiko execution. This will include reaching out back to previously prepared resources and commands and logs collected by them and making sure that the results are in line with what we promise to customers. (No downtime.)

Tobiko allows to separate scenarios into two pieces (in contrast to tempest), that's why this framework is being considered here - to be able to start test resources and processes before adoption is triggered but collect results afterwards.

While the framework allows for these test scenarios, I don't think we ever added any that would do that. Some work may be needed in the framework to make the idea actually work.

There are two major classes of tests we would like to execute:

East-West connectivity. This will imply starting multiple VMs and running monitoring commands inside them through ssh or otherwise.
North-South connectivity. This will imply connection to a VM or several VMs from outside the cluster. It probably means that tobiko will have to run a monitoring process / protocol peer command outside OpenStack cluster. (It may make sense to do that as a pod in OpenShift cluster, but it may be a subject to discussion.)

For the former, the existing tobiko implementation should be enough. (Only a matter of adding new scenarios.) For the latter, tobiko will have to be expanded to support running background processes in pods. Also, we may have to deal with building an image that would contain necessary test tools we'd like to use.

As to some ideas of tests or areas of concerns that could be covered, (copying from a slack discussion, can be expanded during refinement)

highload long running connections (iperf3 udp/tcp) - both N-S and E-W
conntrack behavior
- whether new connections are accepted
- whether established connections are not dropped
  - the latter may be implicitly covered with iperf test
- DNS/DHCP
- IPv6 Neighbour Advertisements not affected
  - point of concern - implemented as controller() action upcalling to ovn-controller
- metadata is working
  - curl in a loop from inside the VM
  - (ipv6 too?)
- Octavia LB health checks
  - unsure, need to reach out to VANS folks for guidance and to check if they are interested

Some info can be found in this Slack thread: https://redhat-internal.slack.com/archives/C046JULBVJ7/p1725977625811099?thread_ts=1725970634.193629&cid=C046JULBVJ7

UPD: another consideration is the effect of a router failover that may be triggered by OVS/OVN restart on N-S connectivity for VMs. This consideration is not unique to adoption / update flow, since router failovers may happen for any number of reasons. We have some whitebox neutron plugin tests that trigger a failover and then validate that the network path is updated, but these tests are not enforcing any SLA on how quick the switch happens. See: https://opendev.org/x/whitebox-neutron-tempest-plugin/src/commit/ea8a27a475ab9a216fe2708ceb5c99c2535a0a2c/whitebox_neutron_tempest_plugin/tests/scenario/test_l3ha_ovn.py The test case may need an expansion to enforce some reasonable downtime SLA.

Assignee:: Ihar Hrachyshka

Reporter:: Ihar Hrachyshka

Team:: rhos-dfg-networking-squad-neutron

Votes:: 0 Vote for this issue

Watchers:: 3 Start watching this issue

Created:: 2024/09/25 5:50 PM

Updated:: 2024/11/20 2:57 PM

Details

Description

Attachments

Easy Agile Planning Poker

Activity

People

Dates

PagerDuty