Loading...

XML

Word

Printable

Type: Bug
Resolution: Duplicate
Priority: Critical
Fix Version/s: None
Affects Version/s: 4.6
Component/s: RHCOS
Labels:
- Cloud_Service
- hh1

Activity Type:
Incidents & Support
Blocked:
False
Blocked Reason:

Hide

Production cluster is not able to access due the freezing of NICs.

Show
Production cluster is not able to access due the freezing of NICs.
Story Points:
None
Severity:
Critical
Regression:
None

Target Backport Versions:
None
Target Version:
None
Release Blocker:
Rejected
Sprint:
SDN Sprint 230, SDN Sprint 231
sprint_count:
2

Customer Impact:

Customer Escalated
RH Private Keywords:

SFDC Cases Counter:
SFDC Cases Open:
SFDC Cases Links:

PX Priority Data:
PX Impact Score:

Release Note Status:
None
Release Note Type:
None
Release Note Text:
None

Escape Reason:
None
Escape Impact:
None
Corrective Measures:
None
SDLC stage when should've been found:
None

Description of problem:

In the last two months, the customer PaaS team experienced nodes’ network interface (NIC) “freeze” on multiple nodes in the production environment (OCP 4.6). It happens only on the worker nodes.

Version-Release number of selected component (if applicable):

OCP 4.6

Actual results:
NIC freezes and customer is getting errors like

" Bridge|WARN| could not open network device veth<ID> (no such device)"
this line shows up 11331 time in log and it starts right around the time of the network freeze - the first time this line shows up is 22:04:41 - the log shows that it deleted this veth device seconds before that
the sdn container stops work working (removing CNI_DEL and adding ip for pods) around the freeze time
crictl logs -t --tail 100 ID_OF_SDN_CONTAINER $> sdn-log.txt
after the freeze the SDN container only shows the log "SyncVNIDRules: 1 unsued VNID"
in the journalctl we saw the log line
"about to del CNI network multus-cni-network (type=multuis)"

Expected results:
NIC freezing MUST not happen.

Additional info:

OCP: 4.6
CASE: 03402417
Customer enviroment is Air Gaped and restricted cluster. So logs like sosreports and MG can not be provided.
Some logs have attached with the case. Kindly refer it.
Customer is facing similar kind of issue like https://bugzilla.redhat.com/show_bug.cgi?id=1893088
Issue Type: Highly escalated

Cluster specifications
Prod - OCP version 4.6.27
Dev - OCP version 4.8.37
Worker nodes are installed on bare metal servers - Cisco UCSB-B200-M5
Master nodes are running on VMware VMs (Hybrid installation)

NOTE: After the restart of node it works perfectly fine.
This issue is intermittent in nature.