Uploaded image for project: 'RHEL'
  1. RHEL
  2. RHEL-16612

Optimizations for x86-64 in zlib package: 3x decompression, 2x compression

Linking RHIVOS CVEs to...Migration: Automation ...SWIFT: POC ConversionSync from "Extern...XMLWordPrintable

    • Icon: Story Story
    • Resolution: Won't Do
    • Icon: Undefined Undefined
    • None
    • None
    • None
    • None
    • None
    • None
    • ssg_core_services
    • None
    • False
    • False
    • Hide

      None

      Show
      None
    • None
    • None
    • None
    • None
    • None

      CentOS zlib package (https://gitlab.com/redhat/centos-stream/rpms/zlib) seems to support IBM Z compression optimizations patches on top of canonical zlib.
       
      There is a huge potential for major performance gains in CentOS package for both x86-64 and Arm architectures if SIMD optimizations are added on the current package.
       
      As an example, these are the reported numbers for the zlib package (1.2.11) shipped
      on CentOS stream 8, running zlib_bench (https://source.chromium.org/chromium/chromium/src/+/main:third_party/zlib/contrib/bench/zlib_bench.cc):
       
      [acavalca@spr3 ~]$ hostnamectl
         Static hostname: spr3.ra.intel.com
               Icon name: computer-server
                 Chassis: server
              Machine ID: 84c65e94b8a44c3abcb440894280dbd1
                 Boot ID: 3b0cf365c9444f238ed07c95dad9a97d
        Operating System: CentOS Stream 8
             CPE OS Name: cpe:/o:centos:centos:8
                  Kernel: Linux 4.18.0-497.el8.x86_64
            Architecture: x86-64
       
      [acavalca@spr3 ~]$ ldd ./zlib_bench_system
      linux-vdso.so.1 (0x00007ffd2a7a5000)
      libz.so.1 => /lib64/libz.so.1 (0x00007f316156e000)
      libstdc+.so.6 => /lib64/libstdc+.so.6 (0x00007f31611d9000)
      libm.so.6 => /lib64/libm.so.6 (0x00007f3160e57000)
      libgcc_s.so.1 => /lib64/libgcc_s.so.1 (0x00007f3160c3f000)
      libc.so.6 => /lib64/libc.so.6 (0x00007f316087a000)
      /lib64/ld-linux-x86-64.so.2 (0x00007f3161786000)
       
      [acavalca@spr3 ~]$ rpm -qa | grep zlib
      zlib-1.2.11-25.el8.x86_64
      zlib-devel-1.2.11-25.el8.x86_64
       
      [acavalca@spr3 ~]$ ./zlib_bench_system gzip corpus/flex/*
      corpus/flex/baddata1.snappy              :
      GZIP: [b 1M] bytes  27512 ->  22920 83.31% comp  45.1 ( 45.3) MB/s uncomp 165.9 (166.2) MB/s
      corpus/flex/geo.protodata                :
      GZIP: [b 1M] bytes 118588 ->  15143 12.77% comp 100.4 (100.6) MB/s uncomp 568.5 (570.3) MB/s
      corpus/flex/html_x_4                     :
      GZIP: [b 1M] bytes 409600 ->  53299 13.01% comp  67.3 ( 67.4) MB/s uncomp 466.1 (466.5) MB/s
       
      The three files above come from the snappy data corpus (https://github.com/google/snappy/tree/main/testdata)
      and have varied entropy features which makes for a draft overview of expected performance.
       
      The benchmark is running in a Xeon 4th gen processor (Platinum 8480).
       
      Now the reported numbers for Chromium zlib:
      [acavalca@spr3 ~]$ ./chromium-zlib/tot/zlib_bench gzip corpus/flex/*
      /home/acavalca/corpus/flex/baddata1.snappy :
      GZIP: [b 1M] bytes  27512 ->  23255 84.53% comp  75.3 ( 75.8) MB/s uncomp 381.7 (383.0) MB/s
      /home/acavalca/corpus/flex/geo.protodata :
      GZIP: [b 1M] bytes 118588 ->  15178 12.80% comp 171.1 (171.6) MB/s uncomp 2339.5 (2401.7) MB/s
      /home/acavalca/corpus/flex/html_x_4      :
      GZIP: [b 1M] bytes 409600 ->  53243 13.00% comp 117.6 (117.8) MB/s uncomp 1705.3 (1708.0) MB/s
       
      And for Cloudflare zlib:
      [acavalca@spr3 ~]$ ./cloudflare-zlib/zlib_bench gzip corpus/flex/*
      corpus/flex/baddata1.snappy              :
      GZIP: [b 1M] bytes  27512 ->  23255 84.5% comp  84.7 ( 84.8) MB/s uncomp 300.7 (301.0) MB/s
      corpus/flex/geo.protodata                :
      GZIP: [b 1M] bytes 118588 ->  15178 12.8% comp 200.8 (201.6) MB/s uncomp 1934.7 (1939.3) MB/s
      corpus/flex/html_x_4                     :
      GZIP: [b 1M] bytes 409600 ->  53246 13.0% comp 139.0 (139.1) MB/s uncomp 1449.1 (1481.8) MB/s
       
      The potential for decompression gains are over 3x (i.e. (381.5 +
      2325.4 + 1675.4) / (166.6 + 570.7 + 466.4) = 3.64) and compression is
      2x ((85 + 202.6 + 139.3) / (44.9 + 99.9 + 67.1) = 2.01) for these
      small data corpus sample.
       
      I don't have numbers for Arm server grade processors but I would
      expect similar gains.
       
      With this small experiment, the potential for considerable performance gains should be clear and given that IBM Z specific patches are currently maintained on CentOS zlib, it seems reasonable to follow a similar approach to other CPU architectures.
       

              Unassigned Unassigned
              adenilsoncavalcanti Adenilson Cavalcanti (Inactive)
              Votes:
              0 Vote for this issue
              Watchers:
              5 Start watching this issue

                Created:
                Updated:
                Resolved: