Uploaded image for project: 'OpenShift Bugs'
  1. OpenShift Bugs
  2. OCPBUGS-29534

Windows worker nodes get bluescreen

    XMLWordPrintable

Details

    • Bug
    • Resolution: Unresolved
    • Normal
    • None
    • 4.13
    • Windows Containers
    • None
    • Moderate
    • No
    • 3
    • WINC - Sprint 254
    • 1
    • False
    • Hide

      None

      Show
      None

    Description

       

      Description of problem:

          Since upgrade from OCP 4.11 to 4.12 and also 4.13 we are experiencing random bluescreen (BSOD) on windows worker nodes. There was a suspicion that BSOD happens after reboot, that's why we established hourly drains and reboots of windows worker nodes. During the week we have already received second BSOD from which we have memory crash dump. Bluescreen reports NDIS.SYS. The only known way to fix the node is to reinstall it, or restore from backup. BSOD can be reproduced on affected node by reboot. It's even possible to start affected worker node in non-network mode and invoke the bluescreen by starting network or kubernetes based services, e.g. kube-proxy, containerd, or network list service. Attaching crash dump and must-gather. Define the value or impact to you or the business causing long downtimes because node is not fixable by reboot .
      Microsoft Support Case
      2402200030004965
      When does this behavior occur? Frequency? Repeatedly? At certain times? 

      Version-Release number of selected component (if applicable):

          8.1.1 WMCO - Windows Server 2022 Standard

      How reproducible:

          Repeatedly, randomly, after reboot of windows worker node.

      Steps to Reproduce:

          1. Reboot always produce a BSOD in customer environment.

      Actual results:

          Having to reinstall the node.

      Expected results:

          Not having to reinstall the node and not receiving BSOD.

      Additional info:

          memory dump with analyse -v of windbg output
      *****************************************
      *                                                                             *
      *                        Bugcheck Analysis                  *
      *                                                                             *
      *****************************************
      
      SYSTEM_SERVICE_EXCEPTION (3b)
      An exception happened while executing a system service routine.
      Arguments:
      Arg1: 00000000c0000005, Exception code that caused the BugCheck
      Arg2: fffff80088462e79, Address of the instruction which caused the BugCheck
      Arg3: ffffb88828952150, Address of the context record for the exception that caused the BugCheck
      Arg4: 0000000000000000, zero.

       

       

       

      Attachments

        Activity

          People

            team-winc Team WinC
            rhn-support-vmedina1 Victor Medina
            Aharon Rasouli Aharon Rasouli
            Votes:
            1 Vote for this issue
            Watchers:
            9 Start watching this issue

            Dates

              Created:
              Updated: