Uploaded image for project: 'RHEL'
  1. RHEL
  2. RHEL-36148

glibc: Previously used TLS sometimes incorrectly reverted to initial state after dlopen [rhel-9]

    • glibc-2.34-122.el9_5
    • None
    • Moderate
    • ZStream
    • Hide
      fe06fb313bddf7e4530056897d4a706606e49377
      5097cd344fd243fb8deb6dec96e8073753f962f9
      Show
      fe06fb313bddf7e4530056897d4a706606e49377 5097cd344fd243fb8deb6dec96e8073753f962f9
    • 1
    • sst_pt_libraries
    • ssg_platform_tools
    • 3
    • False
    • Yes
    • SST PT Libraries Sprint 9
    • Approved Exception
    • Bug Fix
    • Hide
      .TLS data is no longer overwritten by calls to `dlopen()` from an ELF constructor

      Previously, the `glibc` dynamic linker did not track the initialization status of thread-local storage (TLS) correctly in certain cases where the `dlopen()` function was invoked from an ELF constructor. Consequently, TLS data was reverted to its original value after it had been modified by the application. With this update, the dynamic linker uses a separate flag to track TLS initialization for each shared object. As a result, TLS data is no longer unexpectedly overwritten by calls to the `dlopen()` function from an ELF constructor.
      Show
      .TLS data is no longer overwritten by calls to `dlopen()` from an ELF constructor Previously, the `glibc` dynamic linker did not track the initialization status of thread-local storage (TLS) correctly in certain cases where the `dlopen()` function was invoked from an ELF constructor. Consequently, TLS data was reverted to its original value after it had been modified by the application. With this update, the dynamic linker uses a separate flag to track TLS initialization for each shared object. As a result, TLS data is no longer unexpectedly overwritten by calls to the `dlopen()` function from an ELF constructor.
    • Done
    • None

      When a library is loaded and relocated as part of an application and it has TLS memory that it updates and then the library is dynamically loaded with dlopen(), the second load can change the generation counter leaving the library's TLS area set as unallocated causing it to be reallocated the next time that it is used. This results in any information that had been stored in the TLS before the dlopen() of the library being lost.
       
      This problem was reported by a customer when using libomp and librocprofiler and in that case libomp loses the mappings to its threads.
       
      This problem seems to have existed for quite some time. I have verified that it exists as far back as glibc-2.28 in RHEL8 and it still exists in the latest glibc-2.39 found in rawhide. In other words it seems like practically all versions of glibc are affected. 
       
      The sequence of operations is as follows:
       
          Libraries A and B are loaded and relocated
          A's init constructor is called:
              Inside, A calls a function that resolves to B
              B accesses and alters its TLS
          B is then dlopen()'d by "C" (which may be A or B or neither)
              Inside, Glibc advances the generation counter and marks B's TLS as "unallocated"
          B accesses its TLS again, changes from before are lost
       
      In the failing case (audit + rocprof), B is libhpcrun.so, A is libomp.so and "C" is librocprofiler.so. In the suspicious case (no audit + rocprof), B is libomp.so, and A and "C" are libomptarget.so.
       
      Sometimes this bug is masked by the fact that B's TLS is a static block and so even though its TLS gets "reallocated" in the middle it gets the same memory back and it isn't reinitialized in between, so it looks like nothing happened. In other cases, the library whose TLS gets reallocated is written robustly enough that it simply reinitializes its TLS data and the only apparent effect is a loss of allocated memory.
       
      Reproducible: Always
       
      Steps to Reproduce:
      The sequence of operations is as follows:
       
          Libraries A and B are loaded and relocated
          A's init constructor is called:
              Inside, A calls a function that resolves to B
              B accesses and alters its TLS
          B is then dlopen()'d by "C" (which may be A or B or neither)
              Inside, Glibc advances the generation counter and marks B's TLS as "unallocated"
          B accesses its TLS again, changes from before are lost
       
      Attached is the minimal reproducer I cooked up, it has libA.so and libB.so with the roles above ("C" is libB.so). main dlopen()'s libAB.so, libAB.so is a shim that loads libA.so and libB.so via DT_NEEDED so their init constructors run in the right order. libA.so calls libB.so, libB.so sets a flag in its TLS, dlopen()'s itself, and then checks that the state is still set afterword. End result:
       
      $ make
      ./main
      Setting state to SET (42)
      State has reverted, expected state == SET (42) but got 24
      make: *** [Makefile:3: run] Aborted (core dumped)
       
      It also has a Containerfile for easy testing on major distros, just choose your base image and the script will do the rest:
       
      $ podman build --from=registry.access.redhat.com/ubi9/ubi:9.4 path/to/tls-reallocation/
       
      Attached are the required programs for reproducer steps.
       

      Affected glibc version:

      glibc-2.34-100.el9.x86_64 ==> RHEL9.4

            skolosov@redhat.com Sergey Kolosov
            rhn-support-vrajput Virendrasingh Rajput
            Patsy Griffin
            Patsy Griffin Patsy Griffin
            Sergey Kolosov Sergey Kolosov
            Lenka Špačková Lenka Špačková
            Votes:
            0 Vote for this issue
            Watchers:
            16 Start watching this issue

              Created:
              Updated: