-
Task
-
Resolution: Done
-
Undefined
-
None
-
None
-
None
-
RHOAI, Training
-
False
-
False
-
None
I was asked to assist with a high-priority customer case that's been open for 3 months now, in which the customer experienced CSI-driver errors and some of his pods were getting stuck.
looking into the customer logs, I have seen multiple issues on that node, including cgroups OOM's, timeouts, and IO failing to write to disk.
since this issue is randomly occurring on a specific host and only on production (was not able to be reproduced anywhere else), I Investigated the HW aspect of of this ,isse and found out that when the application was accessing a specific segment in the memory starting at 0x564f832464fb to 0x7f6f9f52fd0e:
kernel: RIP: 0033:0x564f832464fb kernel: RIP: 0033:0x7f6f9f4429f5 kernel: RIP: 0033:0x7f6f9f444547 kernel: RIP: 0033:0x7f6f9f4429f5 kernel: RIP: 0033:0x7f6f9f4429f5 kernel: RIP: 0033:0x7f6f9f52fd0e kernel: RIP: 0033:0x7f6f9f444547 kernel: RIP: 0033:0x7f6f9f52f5d7 kernel: RIP: 0033:0x7f6f9f4429f5 kernel: RIP: 0033:0x7f6f9f444547 kernel: RIP: 0033:0x7f6f9f4429f5
which matches this issue's statistical behavior occurrence, I asked for ipmitool memory HW logs, but due to the nature of this issue it's currently not occurring, I left that info in case to when this issue reoccurs so as to not burden the customer with additional tasks other than those he already been asked for.