• Icon: Task Task
    • Resolution: Done
    • Icon: Undefined Undefined
    • Nov 15 release
    • None
    • None
    • None
    • RHOAI, Training
    • False
    • False
    • Hide

      None

      Show
      None

      I was asked to assist with a high-priority customer case that's been open for 3 months now, in which the customer experienced CSI-driver errors and some of his pods were getting stuck.

      looking into the customer logs, I have seen multiple issues on that node, including cgroups OOM's, timeouts, and IO failing to write to disk.

      since this issue is randomly occurring on a specific host and only on production (was not able to be reproduced anywhere else), I Investigated the HW aspect of of this ,isse and found out that when the application was accessing a specific segment in the memory starting at 0x564f832464fb to 0x7f6f9f52fd0e:

      kernel: RIP: 0033:0x564f832464fb 
      kernel: RIP: 0033:0x7f6f9f4429f5 
      kernel: RIP: 0033:0x7f6f9f444547 
      kernel: RIP: 0033:0x7f6f9f4429f5 
      kernel: RIP: 0033:0x7f6f9f4429f5 
      kernel: RIP: 0033:0x7f6f9f52fd0e 
      kernel: RIP: 0033:0x7f6f9f444547 
      kernel: RIP: 0033:0x7f6f9f52f5d7 
      kernel: RIP: 0033:0x7f6f9f4429f5 
      kernel: RIP: 0033:0x7f6f9f444547 
      kernel: RIP: 0033:0x7f6f9f4429f5

      which matches this issue's statistical behavior occurrence, I asked for ipmitool memory HW logs, but due to the nature of this issue it's currently not occurring, I left that info in case to when this issue reoccurs so as to not burden the customer with additional tasks other than those he already been asked for.

       

              bbenshab@redhat.com Boaz Ben Shabat
              bbenshab@redhat.com Boaz Ben Shabat
              Votes:
              0 Vote for this issue
              Watchers:
              3 Start watching this issue

                Created:
                Updated:
                Resolved: