-
Bug
-
Resolution: Unresolved
-
Normal
-
None
-
4.15.z
Description of problem:
Pod to pod communication timing out happening only on one node of a cluster.
Initial issue happened when setting up the nvidia-driver-daemonset
Not all pods are affected as "openshift-network-diagnostics" pods running on that host seems to work, but others are failing.
All fails with error:
dial tcp 172.30.0.1:443: i/o timeout
Version-Release number of selected component (if applicable):
Openshift 4.15.28
How reproducible:
Seems always reproducible in that specific node
Steps to Reproduce:
1.Deploy nvidia-driver-daemonset
2.
3.
Actual results:
Only observed error in that node so far is:
$ less openvswitch/journalctl_--no-pager_--unit_ovs-vswitchd ... Aug 08 04:11:43 node.cluster.example.com ovs-vswitchd[3116]: ovs|00002|dpif(handler416)|WARN|system@ovs-system: execute ct(commit,zone=111,mark=0/0x1,nat(src)),ct(zone=42,nat),recirc(0x11a957) failed (Invalid argument) on packet tcp,vlan_tci=0x0000,dl_src=0a:58:xx:yy:zz:e7,dl_dst=0a:58:xx:yy:zz:18,nw_src=10.xxx.17.231,nw_dst=10.xxx.16.24,nw_tos=0,nw_ecn=0,nw_ttl=64,nw_frag=no,tp_src=8140,tp_dst=40832,tcp_flags=psh|ack tcp_csum:6a30
Expected results:
No error
Additional info:
This is a baremetal node with GPU, but is not the only one, there are other 2 that have are part of a different machine-config-pool and doesn't have any reported issue.
Affected Platforms:
Agnostic cluster with virtualized and baremetal nodes
- is related to
-
OCPBUGS-12251 Continuation of reopened BZ2100045 - OVS complains Invalid Argument on TCP packets going into conntrack
- Closed