Uploaded image for project: 'RHEL'
  1. RHEL
  2. RHEL-67847

amdgpu failed to initialize when multiple AMD MI210 GPUs assigned and firmware is seabios [rhel-10]

    • Icon: Bug Bug
    • Resolution: Unresolved
    • Icon: Major Major
    • rhel-10.0
    • rhel-10.0
    • seabios
    • No
    • Important
    • rhel-sst-virtualization
    • 16
    • 3
    • False
    • Hide

      None

      Show
      None
    • None
    • Red Hat Enterprise Linux
    • None
    • x86_64
    • None

      What were you trying to do that didn't work?
      amdgpu failed to initialize when multiple AMD MI210 GPUs assigned and firmware is seabios

      Please provide the package NVR for which bug is seen:
      kernel-5.14.0-528.el9.x86_64
      qemu-kvm-9.1.0-1.el9.x86_64
      libvirt-10.8.0-2.el9.x86_64
      seabios-bin-1.16.3-2.el9.noarch
      edk2-ovmf-20240524-8.el9.noarch

      How reproducible:
      100%

      Steps to reproduce
      1.Boot a rhel 9.6 VM with 2x AMD MI210 GPUs
      2.
      3.

      Expected results
      Guest boot normally without guest driver crash

      Actual results
      driver amdgpu crash the guest kernel with log:

      Unable to find source-code formatter for language: bash. Available languages are: actionscript, ada, applescript, bash, c, c#, c++, cpp, css, erlang, go, groovy, haskell, html, java, javascript, js, json, lua, none, nyan, objc, perl, php, python, r, rainbow, ruby, scala, sh, sql, swift, visualbasic, xml, yaml
      
      [   54.688466] amdgpu 0000:05:00.0: BAR 6: can't assign [??? 0x00000000 flags 0x20000000] (bogus alignment)
      [   54.692068] amdgpu 0000:05:00.0: amdgpu: Fetched VBIOS from ROM
      [   54.692444] amdgpu: ATOM BIOS: 113-D67301V-073
      [   54.693535] [drm] CP firmware version too old, please update!
      [   54.693572] [drm] VCN(0) decode is enabled in VM mode
      [   54.694256] [drm] VCN(1) decode is enabled in VM mode
      [   54.694576] [drm] VCN(0) encode is enabled in VM mode
      [   54.694888] [drm] VCN(1) encode is enabled in VM mode
      [   54.695720] [drm] JPEG(0) JPEG decode is enabled in VM mode
      [   54.696073] [drm] JPEG(1) JPEG decode is enabled in VM mode
      [   54.696424] amdgpu 0000:05:00.0: amdgpu: Trusted Memory Zone (TMZ) feature not supported
      [   54.696970] amdgpu 0000:05:00.0: amdgpu: MEM ECC is active.
      [   54.697313] amdgpu 0000:05:00.0: amdgpu: SRAM ECC is active.
      [   54.697682] amdgpu 0000:05:00.0: amdgpu: RAS INFO: ras initialized successfully, hardware ability[7ff7f] ras_mask[7ff7f]
      [   54.698363] [drm] vm size is 262144 GB, 4 levels, block size is 9-bit, fragment size is 9-bit
      [   54.915064] amdgpu 0000:05:00.0: BAR 2: releasing [mem 0x385800000000-0x3858001fffff 64bit pref]
      [   54.915718] amdgpu 0000:05:00.0: BAR 0: releasing [??? 0x00000000 flags 0x0]
      [   54.916231] [drm:amdgpu_device_resize_fb_bar [amdgpu]] *ERROR* Problem resizing BAR0 (-16).
      [   54.916598] amdgpu 0000:05:00.0: BAR 6: [??? 0x00000000 flags 0x20000000] has bogus alignment
      [   54.922662] amdgpu 0000:05:00.0: BAR 2: assigned [mem 0x384800000000-0x3848001fffff 64bit pref]
      [   56.679384] amdgpu 0000:05:00.0: amdgpu: VRAM: 65520M 0x0000020000000000 - 0x0000020FFEFFFFFF (65520M used)
      [   56.681873] amdgpu 0000:05:00.0: amdgpu: GART: 512M 0x0000000000000000 - 0x000000001FFFFFFF
      [   56.682549] [drm:amdgpu_bo_init [amdgpu]] *ERROR* Unable to set WC memtype for the aperture base
      [   56.683593] [drm:amdgpu_device_ip_init [amdgpu]] *ERROR* sw_init of IP block <gmc_v9_0> failed -22
      [   56.684585] amdgpu 0000:05:00.0: amdgpu: amdgpu_device_ip_init failed
      [   56.685101] amdgpu 0000:05:00.0: amdgpu: Fatal error during GPU init
      [   56.685558] amdgpu 0000:05:00.0: amdgpu: amdgpu: finishing device.
      [   56.686280] amdgpu: probe of 0000:05:00.0 failed with error -22
      [   56.686775] BUG: kernel NULL pointer dereference, address: 0000000000000000
      [   56.687325] #PF: supervisor read access in kernel mode
      [   56.687750] #PF: error_code(0x0000) - not-present page
      [   56.688127] PGD 111e19067 P4D 0 
      [   56.688368] Oops: 0000 [#1] PREEMPT SMP NOPTI
      [   56.688727] CPU: 13 PID: 578 Comm: systemd-udevd Tainted: G           OE     -------  ---  5.14.0-427.42.1.el9_4.x86_64 #1
      [   56.689562] Hardware name: Red Hat KVM/RHEL, BIOS 1.16.3-2.el9 04/01/2014
      [   56.690062] RIP: 0010:amdgpu_mca_bank_set_release+0x18/0xb0 [amdgpu]
      
      

              rhn-engineering-ghoffman Gerd Hoffmann
              rhn-support-zhguo Zhiyi Guo
              virt-maint virt-maint
              Zhiyi Guo Zhiyi Guo
              Votes:
              0 Vote for this issue
              Watchers:
              8 Start watching this issue

                Created:
                Updated: