Loading...

XML

Word

Printable

Type: Task
Resolution: Done
Priority: Undefined
Fix Version/s: Nov 15 release
Affects Version/s: None
Component/s: None
Labels:
None

Workstream:

RHOAI, Training
Ready:
False
Blocked:
False
Blocked Reason:

Hide

None

Show
None

SFDC Cases Counter:
SFDC Cases Links:
SFDC Cases Open:

Intelligence Requested:
Market:

I was asked to assist with a high-priority customer case that's been open for 3 months now, in which the customer experienced CSI-driver errors and some of his pods were getting stuck.

looking into the customer logs, I have seen multiple issues on that node, including cgroups OOM's, timeouts, and IO failing to write to disk.

since this issue is randomly occurring on a specific host and only on production (was not able to be reproduced anywhere else), I Investigated the HW aspect of of this ,isse and found out that when the application was accessing a specific segment in the memory starting at 0x564f832464fb to 0x7f6f9f52fd0e:

kernel: RIP: 0033:0x564f832464fb 
kernel: RIP: 0033:0x7f6f9f4429f5 
kernel: RIP: 0033:0x7f6f9f444547 
kernel: RIP: 0033:0x7f6f9f4429f5 
kernel: RIP: 0033:0x7f6f9f4429f5 
kernel: RIP: 0033:0x7f6f9f52fd0e 
kernel: RIP: 0033:0x7f6f9f444547 
kernel: RIP: 0033:0x7f6f9f52f5d7 
kernel: RIP: 0033:0x7f6f9f4429f5 
kernel: RIP: 0033:0x7f6f9f444547 
kernel: RIP: 0033:0x7f6f9f4429f5

which matches this issue's statistical behavior occurrence, I asked for ipmitool memory HW logs, but due to the nature of this issue it's currently not occurring, I left that info in case to when this issue reoccurs so as to not burden the customer with additional tasks other than those he already been asked for.

Assignee:: Boaz Ben Shabat

Reporter:: Boaz Ben Shabat

Votes:: 0 Vote for this issue

Watchers:: 3 Start watching this issue

Created:: 2024/02/28 12:46 PM

Updated:: 2024/11/15 6:50 PM

Resolved:: 2024/11/11 12:05 PM

Details

Description

Attachments

Easy Agile Planning Poker

Activity

People

Dates