• Icon: Task Task
    • Resolution: Done
    • Icon: Undefined Undefined
    • Nov 15 release
    • None
    • None
    • None
    • RHOAI, Training
    • False
    • False
    • None

      I was asked to assist with a high-priority customer case that's been open for 3 months now, in which the customer experienced CSI-driver errors and some of his pods were getting stuck.

      looking into the customer logs, I have seen multiple issues on that node, including cgroups OOM's, timeouts, and IO failing to write to disk.

      since this issue is randomly occurring on a specific host and only on production (was not able to be reproduced anywhere else), I Investigated the HW aspect of of this ,isse and found out that when the application was accessing a specific segment in the memory starting at 0x564f832464fb to 0x7f6f9f52fd0e:

      kernel: RIP: 0033:0x564f832464fb 
      kernel: RIP: 0033:0x7f6f9f4429f5 
      kernel: RIP: 0033:0x7f6f9f444547 
      kernel: RIP: 0033:0x7f6f9f4429f5 
      kernel: RIP: 0033:0x7f6f9f4429f5 
      kernel: RIP: 0033:0x7f6f9f52fd0e 
      kernel: RIP: 0033:0x7f6f9f444547 
      kernel: RIP: 0033:0x7f6f9f52f5d7 
      kernel: RIP: 0033:0x7f6f9f4429f5 
      kernel: RIP: 0033:0x7f6f9f444547 
      kernel: RIP: 0033:0x7f6f9f4429f5

      which matches this issue's statistical behavior occurrence, I asked for ipmitool memory HW logs, but due to the nature of this issue it's currently not occurring, I left that info in case to when this issue reoccurs so as to not burden the customer with additional tasks other than those he already been asked for.

       

              bbenshab Boaz Ben Shabat
              bbenshab Boaz Ben Shabat
              Votes:
              0 Vote for this issue
              Watchers:
              3 Start watching this issue

                Created:
                Updated:
                Resolved: