Uploaded image for project: 'RHEL'
  1. RHEL
  2. RHEL-68997

kernel: Corruption of AArch64 SVE state

    • Yes
    • Important
    • 1
    • rhel-sst-arch-hw
    • ssg_platform_enablement
    • 1
    • 3
    • 2
    • Dev ack
    • False
    • Hide

      None

      Show
      None
    • No
    • Red Hat Enterprise Linux
    • Virt ARM 25-2
    • Unspecified Release Note Type - Unknown
    • aarch64
    • None

      When running a guest on A64FX, we hit a corruption after several guest reboots. We have at least 2 different reproducers. One where the corruption after more than 1d (RHEL-22598) and the other one (RHEL-67106) where we hit it generally within tens of minutes. After further debug at QEMU level we identified a code section that may be the cause of the corruption in flatview_insert(). If we comment out the memmove call and replace it by individual cell copies, we do not hit the issue anymore.

      static void flatview_insert(FlatView *view, unsigned pos, FlatRange *range)
      {
          int i = view->nr;
          if (view->nr == view->nr_allocated)
      
      {         view->nr_allocated = MAX(2 * view->nr, 10);         view->ranges = g_realloc(view->ranges,                                     view->nr_allocated * sizeof(*view->ranges));     }
      
      #if 0
          memmove(view->ranges + pos + 1, view->ranges + pos,
                  (view->nr - pos) * sizeof(FlatRange));
      #else
           while (i > pos)
      
      {         view->ranges[i] = view->ranges[i - 1];         i--;     }
      
      #endif
          view->ranges[pos] = *range;
          memory_region_ref(range->mr);
          ++view->nr;
      }
      

      So we wonder whether there could be something wrong with the memmove implementation on this A64FX HW. After a dicussion with fweimer@redhat.com, it looks the rhel9 code for the memset/memcpy/memmove selectors in glibc in RHEL 9 check midr for A64FX.

      So this Jira ticket is a request to produce a test build with the A64FX string routines ripped out so that glibc would use the generic implementation, just to see if it removes the issue.

              eauger Eric Auger
              eauger Eric Auger
              Arch HW AArch64 Triage Arch HW AArch64 Triage
              Liang Cong Liang Cong
              Votes:
              0 Vote for this issue
              Watchers:
              20 Start watching this issue

                Created:
                Updated: