-
Bug
-
Resolution: Not a Bug
-
Normal
-
None
-
openshift-4.14.z
-
None
-
False
-
None
-
False
-
-
Description of problem:
Setting up bidirectional traffic using Trex from baremetal to the TestPMD server, running within a pod in an SRIOV Worker VM hosted in Compstack SRIOV Compute, faced an issue. Despite proper configuration, the TestPMD application within the pod failed to process any traffic.
024-01-31 09:35:57.755716 UTC|Port statistics ==================================== 2024-01-31 09:35:57.755729 UTC| ######################## NIC statistics for port 0 ######################## 2024-01-31 09:35:57.755738 UTC| RX-packets: 0 RX-missed: 0 RX-bytes: 0 2024-01-31 09:35:57.755747 UTC| RX-errors: 0 2024-01-31 09:35:57.755755 UTC| RX-nombuf: 0 2024-01-31 09:35:57.755764 UTC| TX-packets: 0 TX-errors: 0 TX-bytes: 0 2024-01-31 09:35:57.755772 UTC| 2024-01-31 09:35:57.755780 UTC| Throughput (since last show) 2024-01-31 09:35:57.755788 UTC| Rx-pps: 0 Rx-bps: 0 2024-01-31 09:35:57.755796 UTC| Tx-pps: 0 Tx-bps: 0 2024-01-31 09:35:57.755846 UTC| ############################################################################ 2024-01-31 09:35:57.755856 UTC| 2024-01-31 09:35:57.755864 UTC| ######################## NIC statistics for port 1 ######################## 2024-01-31 09:35:57.755872 UTC| RX-packets: 0 RX-missed: 0 RX-bytes: 0 2024-01-31 09:35:57.755880 UTC| RX-errors: 0 2024-01-31 09:35:57.755889 UTC| RX-nombuf: 0 2024-01-31 09:35:57.755897 UTC| TX-packets: 0 TX-errors: 0 TX-bytes: 0 2024-01-31 09:35:57.755905 UTC| 2024-01-31 09:35:57.755913 UTC| Throughput (since last show) 2024-01-31 09:35:57.755922 UTC| Rx-pps: 0 Rx-bps: 0 2024-01-31 09:35:57.755930 UTC| Tx-pps: 0 Tx-bps: 0 2024-01-31 09:35:57.755938 UTC| ############################################################################
Version-Release number of selected component (if applicable):
OCP 4.14.0
RHOSP 17.1.1
Network Topology: https://docs.google.com/drawings/d/1c1pqYlI3odemKN3JCoTVsTUJC9dx-272jJc80_rqDH8/edit?usp=sharing
How reproducible: Fully reproducible in Perf NFV lab and Perf Scale lab.
Steps to Reproduce:
1. Successfully deployed OpenStack 17.1 with SR-IOV compute nodes and conducted validation testing of the normal VM testpmd scenario, which passed the sanity test.
2. A deployment of an OCP-4.14.0 cluster was executed, featuring SRIOV Worker VM integration.
$ openstack server list --all +--------------------------------------+----------------------------------------+--------+--------------------------------------------------------------------------------+-------------------------------+------------+ | ID | Name | Status | Networks | Image | Flavor | +--------------------------------------+----------------------------------------+--------+--------------------------------------------------------------------------------+-------------------------------+------------+ | 31cbd348-93b3-4a3c-b005-f39d0fee2757 | x7g3jgblorhocpnfv-rqwzh-worker-0-bt8cv | ACTIVE | management=192.168.0.63; provider-1=192.168.177.79; provider-2=192.168.178.249 | x7g3jgblorhocpnfv-rqwzh-rhcos | sos-worker | | 00d9fd9f-316b-41f4-94d0-fa004444215f | x7g3jgblorhocpnfv-rqwzh-master-2 | ACTIVE | management=192.168.0.68 | x7g3jgblorhocpnfv-rqwzh-rhcos | sos-master | | 19999101-9a6a-4f18-949b-81fea28d9505 | x7g3jgblorhocpnfv-rqwzh-master-1 | ACTIVE | management=192.168.0.91 | x7g3jgblorhocpnfv-rqwzh-rhcos | sos-master | | b552cc4b-e4ff-4621-bbd1-9a4125739fa5 | x7g3jgblorhocpnfv-rqwzh-master-0 | ACTIVE | management=192.168.0.65 | x7g3jgblorhocpnfv-rqwzh-rhcos | sos-master | +--------------------------------------+----------------------------------------+--------+--------------------------------------------------------------------------------+-------------------------------+------------+
3. The port_security and security group in the SRIOV port for networks provider-1 and provider-2 were manually disabled. However, it was found that keeping portsecurity enabled and securitygroup in the SRIOV port did not contribute to any traffic processing in the testpmd pods. Therefore, the decision was made to disable these features which was enabled by default during the worker vm provision.
$ openstack port list --server 31cbd348-93b3-4a3c-b005-f39d0fee2757 +--------------------------------------+--------------------------------------------------+-------------------+--------------------------------------------------------------------------------+--------+ | ID | Name | MAC Address | Fixed IP Addresses | Status | +--------------------------------------+--------------------------------------------------+-------------------+--------------------------------------------------------------------------------+--------+ | 06a11995-3e72-456a-af66-568049a7d619 | x7g3jgblorhocpnfv-rqwzh-worker-0-bt8cv-provider1 | fa:16:3e:e2:37:4f | ip_address='192.168.177.79', subnet_id='fa92ca2c-c4d6-4cd3-a736-ca1572c7cc7a' | ACTIVE | | 082e613f-64e9-40da-9622-956d6bd9309c | x7g3jgblorhocpnfv-rqwzh-worker-0-bt8cv-provider2 | fa:16:3e:11:3b:0c | ip_address='192.168.178.249', subnet_id='b28950e5-575f-470c-ab41-d8e998a7f19b' | ACTIVE | | bfd789b6-273f-45f9-afbe-2006242a2142 | x7g3jgblorhocpnfv-rqwzh-worker-0-bt8cv-0 | fa:16:3e:5b:a4:e8 | ip_address='192.168.0.63', subnet_id='df7eebe1-89e7-4dd5-aaef-6d56ab513e4c' | ACTIVE | +--------------------------------------+--------------------------------------------------+-------------------+--------------------------------------------------------------------------------+--------+ $ openstack port show --fit-width 06a11995-3e72-456a-af66-568049a7d619 +-------------------------+---------------------------------------------------------------------------------------------------------------------------+ | Field | Value | +-------------------------+---------------------------------------------------------------------------------------------------------------------------+ | admin_state_up | UP | | allowed_address_pairs | | | binding_host_id | compute-1.redhat.local | | binding_profile | pci_slot='0000:98:00.3', pci_vendor_info='15b3:101e', physical_network='provider1' | | binding_vif_details | connectivity='l2', port_filter='False', vlan='177' | | binding_vif_type | hw_veb | | binding_vnic_type | direct | | created_at | 2024-01-23T04:26:32Z | | data_plane_status | None | | description | Created by cluster-api-provider-openstack cluster openshift-machine-api-x7g3jgblorhocpnfv-rqwzh | | device_id | 31cbd348-93b3-4a3c-b005-f39d0fee2757 | | device_owner | compute:worker | | device_profile | None | | dns_assignment | fqdn='host-192-168-177-79.openstacklocal.', hostname='host-192-168-177-79', ip_address='192.168.177.79' | | dns_domain | | | dns_name | | | extra_dhcp_opts | | | fixed_ips | ip_address='192.168.177.79', subnet_id='fa92ca2c-c4d6-4cd3-a736-ca1572c7cc7a' | | id | 06a11995-3e72-456a-af66-568049a7d619 | | ip_allocation | None | | mac_address | fa:16:3e:e2:37:4f | | name | x7g3jgblorhocpnfv-rqwzh-worker-0-bt8cv-provider1 | | network_id | 3296c1bf-6600-4b57-87cf-79d133326417 | | numa_affinity_policy | None | | port_security_enabled | False | | project_id | ee9d56de530b4cc283b6a6ee5b645e56 | | propagate_uplink_status | None | | qos_network_policy_id | None | | qos_policy_id | None | | resource_request | None | | revision_number | 9 | | security_group_ids | | | status | ACTIVE | | tags | cluster-api-provider-openstack, openshift-machine-api-x7g3jgblorhocpnfv-rqwzh, openshiftClusterID=x7g3jgblorhocpnfv-rqwzh | | trunk_details | None | | updated_at | 2024-01-24T06:02:27Z | +-------------------------+---------------------------------------------------------------------------------------------------------------------------+ $ openstack port show --fit-width 082e613f-64e9-40da-9622-956d6bd9309c +-------------------------+---------------------------------------------------------------------------------------------------------------------------+ | Field | Value | +-------------------------+---------------------------------------------------------------------------------------------------------------------------+ | admin_state_up | UP | | allowed_address_pairs | | | binding_host_id | compute-1.redhat.local | | binding_profile | pci_slot='0000:98:01.4', pci_vendor_info='15b3:101e', physical_network='provider2' | | binding_vif_details | connectivity='l2', port_filter='False', vlan='178' | | binding_vif_type | hw_veb | | binding_vnic_type | direct | | created_at | 2024-01-23T04:26:33Z | | data_plane_status | None | | description | Created by cluster-api-provider-openstack cluster openshift-machine-api-x7g3jgblorhocpnfv-rqwzh | | device_id | 31cbd348-93b3-4a3c-b005-f39d0fee2757 | | device_owner | compute:worker | | device_profile | None | | dns_assignment | fqdn='host-192-168-178-249.openstacklocal.', hostname='host-192-168-178-249', ip_address='192.168.178.249' | | dns_domain | | | dns_name | | | extra_dhcp_opts | | | fixed_ips | ip_address='192.168.178.249', subnet_id='b28950e5-575f-470c-ab41-d8e998a7f19b' | | id | 082e613f-64e9-40da-9622-956d6bd9309c | | ip_allocation | None | | mac_address | fa:16:3e:11:3b:0c | | name | x7g3jgblorhocpnfv-rqwzh-worker-0-bt8cv-provider2 | | network_id | 3782f274-2806-48db-bcf5-09baee292a6e | | numa_affinity_policy | None | | port_security_enabled | False | | project_id | ee9d56de530b4cc283b6a6ee5b645e56 | | propagate_uplink_status | None | | qos_network_policy_id | None | | qos_policy_id | None | | resource_request | None | | revision_number | 9 | | security_group_ids | | | status | ACTIVE | | tags | cluster-api-provider-openstack, openshift-machine-api-x7g3jgblorhocpnfv-rqwzh, openshiftClusterID=x7g3jgblorhocpnfv-rqwzh | | trunk_details | None | | updated_at | 2024-01-24T06:03:09Z | +-------------------------+---------------------------------------------------------------------------------------------------------------------------+
4. Details of OC nodes, Nodepolicy and Network attachment details for reference
$ oc get nodes NAME STATUS ROLES AGE VERSION x7g3jgblorhocpnfv-rqwzh-master-0 Ready control-plane,master 9d v1.27.6+f67aeb3 x7g3jgblorhocpnfv-rqwzh-master-1 Ready control-plane,master 9d v1.27.6+f67aeb3 x7g3jgblorhocpnfv-rqwzh-master-2 Ready control-plane,master 9d v1.27.6+f67aeb3 x7g3jgblorhocpnfv-rqwzh-worker-0-bt8cv Ready worker 8d v1.27.6+f67aeb3
Performance Profile
$ oc get PerformanceProfile cnf-performanceprofile -o yaml apiVersion: performance.openshift.io/v2 kind: PerformanceProfile metadata: creationTimestamp: "2024-01-23T04:20:02Z" finalizers: - foreground-deletion generation: 1 name: cnf-performanceprofile resourceVersion: "428030" uid: bad0d613-cefd-445d-bfa3-08ef005a1df8 spec: additionalKernelArgs: - nmi_watchdog=0 - audit=0 - mce=off - processor.max_cstate=1 - idle=poll - intel_idle.max_cstate=0 - amd_iommu=on cpu: isolated: 10-39 reserved: 0-9 globallyDisableIrqLoadBalancing: true hugepages: defaultHugepagesSize: 1G pages: - count: 20 node: 0 size: 1G nodeSelector: node-role.kubernetes.io/worker: "" realTimeKernel: enabled: false
machine config pool:
$ oc get mcp NAME CONFIG UPDATED UPDATING DEGRADED MACHINECOUNT READYMACHINECOUNT UPDATEDMACHINECOUNT DEGRADEDMACHINECOUNT AGE master rendered-master-77318084f5f57ad3f8d371a4f4a11243 True False False 3 3 3 0 9d worker rendered-worker-392b92e66bf543d3d07f262962c2beac True False False 1 1 1 0 9d
Pods inside openshift-sriov-network-operator
$ oc get pods --no-headers -n openshift-sriov-network-operator network-resources-injector-68nct 1/1 Running 0 7d4h network-resources-injector-9z6wf 1/1 Running 0 7d4h network-resources-injector-n2grc 1/1 Running 0 7d4h operator-webhook-22vsq 1/1 Running 0 7d4h operator-webhook-lgjnn 1/1 Running 0 7d4h operator-webhook-ppnfp 1/1 Running 0 7d4h sriov-device-plugin-5kts9 1/1 Running 0 7d3h sriov-network-config-daemon-mznhr 1/1 Running 0 7d4h sriov-network-operator-fbb787845-tj7k4 1/1 Running 0 7d4h
Details of SriovNetworkNodePolicy
$ oc get SriovNetworkNodePolicy -n openshift-sriov-network-operator NAME AGE default 7d4h provider1 7d3h provider2 7d3h $ oc get SriovNetworkNodePolicy -n openshift-sriov-network-operator provider1 -o yaml apiVersion: sriovnetwork.openshift.io/v1 kind: SriovNetworkNodePolicy metadata: creationTimestamp: "2024-01-24T06:38:47Z" generation: 1 name: provider1 namespace: openshift-sriov-network-operator resourceVersion: "931467" uid: e9fa1569-ee45-461f-b0b1-3a96aa3a72d4 spec: deviceType: netdevice isRdma: true nicSelector: netFilter: openstack/NetworkID:3296c1bf-6600-4b57-87cf-79d133326417 nodeSelector: feature.node.kubernetes.io/network-sriov.capable: "true" numVfs: 1 priority: 99 resourceName: mlx_provider1 $ oc get SriovNetworkNodePolicy -n openshift-sriov-network-operator provider2 -o yaml apiVersion: sriovnetwork.openshift.io/v1 kind: SriovNetworkNodePolicy metadata: creationTimestamp: "2024-01-24T06:38:46Z" generation: 1 name: provider2 namespace: openshift-sriov-network-operator resourceVersion: "931468" uid: 5b6aa998-797d-4c2f-8cc9-f031b42fb339 spec: deviceType: netdevice isRdma: true nicSelector: netFilter: openstack/NetworkID:3782f274-2806-48db-bcf5-09baee292a6e nodeSelector: feature.node.kubernetes.io/network-sriov.capable: "true" numVfs: 1 priority: 99 resourceName: mlx_provider2
5. Before utilizing the Crucible K8s endpoint, it's necessary to establish a predefined namespace. Within this namespace, a network attachment must be created to enable Crucible to identify its designated namespace and create the required resources. Upon completion of the task execution, the namespace is automatically deleted. This process ensures proper configuration and cleanup within the OCP environment for Crucible's operations.
$ oc get network-attachment-definitions -n crucible-rickshaw NAME AGE sriov-provider1-net 3m42s sriov-provider2-net 3m42s $ oc get network-attachment-definitions -n crucible-rickshaw sriov-provider1-net -o yaml apiVersion: k8s.cni.cncf.io/v1 kind: NetworkAttachmentDefinition metadata: annotations: k8s.v1.cni.cncf.io/resourceName: openshift.io/mlx_provider1 creationTimestamp: "2024-01-31T10:39:48Z" generation: 1 name: sriov-provider1-net namespace: crucible-rickshaw resourceVersion: "4090038" uid: 983d5156-5255-4887-8c89-0d68271874ae spec: config: '{ "type": "host-device", "cniVersion": "0.3.1", "name": "sriov-provider1", "pciBusId": "0000:05:00.0", "ipam": {} }' $ oc get network-attachment-definitions -n crucible-rickshaw sriov-provider2-net -o yaml apiVersion: k8s.cni.cncf.io/v1 kind: NetworkAttachmentDefinition metadata: annotations: k8s.v1.cni.cncf.io/resourceName: openshift.io/mlx_provider2 creationTimestamp: "2024-01-31T10:39:48Z" generation: 1 name: sriov-provider2-net namespace: crucible-rickshaw resourceVersion: "4090040" uid: 72628caf-3c2d-4568-8c18-f66e72f5f7ea spec: config: '{ "type": "host-device", "cniVersion": "0.3.1", "name": "sriov-provider2", "pciBusId": "0000:06:00.0", "ipam": {} }'
6. Here is Crucible logs which has created during the testing
The Crucible benchmark main failure log events: http://storage.scalelab.redhat.com/psahoo/PerfTaskLog/debuging/trafficgen--2024-01-31_09%3A33%3A06_UTC--14b94215-ec4a-41da-88fe-fba4afc237ab/crucible.log.xz
Profile Configuration details between Trex Client and TestPMD server:
http://storage.scalelab.redhat.com/psahoo/PerfTaskLog/debuging/trafficgen--2024-01-31_09%3A33%3A06_UTC--14b94215-ec4a-41da-88fe-fba4afc237ab/config/trafficgen-mv-params.json
7. The following content provides details about the TestPMD Server Pod created by the Crucible Kubernetes (K8s) endpoint in an SRIOV Worker VM.
http://storage.scalelab.redhat.com/psahoo/PerfTaskLog/debuging/trafficgen--2024-01-31_09%3A33%3A06_UTC--14b94215-ec4a-41da-88fe-fba4afc237ab/run/endpoint/k8s-1/kubectl-get-pod-server-1.json
8. Crucible engine logs to prepare TestPMD execution in server pod
http://storage.scalelab.redhat.com/psahoo/PerfTaskLog/debuging/trafficgen--2024-01-31_09%3A33%3A06_UTC--14b94215-ec4a-41da-88fe-fba4afc237ab/run/iterations/iteration-1-fail/sample-1-fail-1/server/1/trafficgen-server-start-stderrout.txt
10. Trex binary-search log which report Rx drop
http://storage.scalelab.redhat.com/psahoo/PerfTaskLog/debuging/trafficgen--2024-01-31_09%3A33%3A06_UTC--14b94215-ec4a-41da-88fe-fba4afc237ab/run/iterations/iteration-1-fail/sample-1-fail-1/client/1/binary-search.txt.xz
[2024-01-31 09:39:02.895989][BSO] (Evaluating trial)
[2024-01-31 09:39:02.895994][BSO] (critical requirement failure, no packets were received between device pair: 0 -> 1, trial result: abort)
[2024-01-31 09:39:02.895999][BSO] (critical requirement failure, individual stream 100% RX packet loss , device pair: 0 -> 1, pg_ids: [1,64], trial result: abort)
[2024-01-31 09:39:02.896002][BSO] (critical requirement failure, no packets were received between device pair: 1 -> 0, trial result: abort)
[2024-01-31 09:39:02.896004][BSO] (critical requirement failure, individual stream 100% RX packet loss , device pair: 1 -> 0, pg_ids: [128,191], trial result: abort)
Actual results: 100% RX packet loss
Expected results: Optimal packet transmission is contingent upon accurate port configurations. A seamless network experience is achieved when ports are configured correctly, minimizing the likelihood of packet drops.
Additional info: The ShiftOnStack cluster environment is accessible for real-time troubleshooting of packet drop issues.