Uploaded image for project: 'Red Hat OpenStack Services on OpenShift'
  1. Red Hat OpenStack Services on OpenShift
  2. OSPRH-8712

Socket id's have potential to be invalid when cpu power managment is enabled

XMLWordPrintable

    • Icon: Bug Bug
    • Resolution: Done-Errata
    • Icon: Blocker Blocker
    • rhos-18.0.3
    • rhos-18.0.0
    • openstack-nova
    • None
    • 1
    • False
    • Hide

      None

      Show
      None
    • False
    • ?
    • No Docs Impact
    • OSPRH-811 - Red Hat OpenStack 18.0 Greenfield Deployment
    • openstack-nova-27.5.1-18.0.20240830154702.3e75b4f.el9osttrunk
    • ?
    • ?
    • None
    • Hide
      .NUMA resource tracking works correctly

      With this release, a bug that causes NUMA resource tracking issues has been fixed. Previously, Libvirt reported all powered down CPUs on NUMA node 0 instead of on the correct NUMA node. Now, Nova caches the correct CPU topology before powering down any CPUs, fixing the resource tracking issues.
      Show
      .NUMA resource tracking works correctly With this release, a bug that causes NUMA resource tracking issues has been fixed. Previously, Libvirt reported all powered down CPUs on NUMA node 0 instead of on the correct NUMA node. Now, Nova caches the correct CPU topology before powering down any CPUs, fixing the resource tracking issues.
    • Bug Fix
    • Done
    • Moderate

      When cpu power management is enabled, the socket id for a cpu has the potential to report as 0 when the cpu actually belongs to another socket.  For example here is the output of 'virsh capabilites' on a multi socket Host where the cpu's located on socket id 1 are instead reporting 0.

              <cell id='3'>
                <memory unit='KiB'>66049368</memory>
                <pages unit='KiB' size='4'>3929430</pages>
                <pages unit='KiB' size='2048'>0</pages>
                <pages unit='KiB' size='1048576'>48</pages>
                <distances>
                  <sibling id='0' value='32'/>
                  <sibling id='1' value='32'/>
                  <sibling id='2' value='12'/>
                  <sibling id='3' value='10'/>
                </distances>
                <cpus num='24'>
                  <cpu id='36' socket_id='1' die_id='1' cluster_id='65535' core_id='16' siblings='36,84'/>
                  <cpu id='37' socket_id='1' die_id='1' cluster_id='65535' core_id='17' siblings='37,85'/>
                  <cpu id='38' socket_id='1' die_id='1' cluster_id='65535' core_id='18' siblings='38,86'/>
                  <cpu id='39' socket_id='1' die_id='1' cluster_id='65535' core_id='20' siblings='39,87'/>
                  <cpu id='40' socket_id='1' die_id='1' cluster_id='65535' core_id='21' siblings='40,88'/>
                  <cpu id='41' socket_id='0' die_id='0' cluster_id='0' core_id='0' siblings='41'/>
                  <cpu id='42' socket_id='0' die_id='0' cluster_id='0' core_id='0' siblings='42'/>
                  <cpu id='43' socket_id='0' die_id='0' cluster_id='0' core_id='0' siblings='43'/>
                  <cpu id='44' socket_id='0' die_id='0' cluster_id='0' core_id='0' siblings='44'/>
                  <cpu id='45' socket_id='0' die_id='0' cluster_id='0' core_id='0' siblings='45'/>
                  <cpu id='46' socket_id='0' die_id='0' cluster_id='0' core_id='0' siblings='46'/>
                  <cpu id='47' socket_id='0' die_id='0' cluster_id='0' core_id='0' siblings='47'/>
                  <cpu id='84' socket_id='1' die_id='1' cluster_id='65535' core_id='16' siblings='36,84'/>
                  <cpu id='85' socket_id='1' die_id='1' cluster_id='65535' core_id='17' siblings='37,85'/>
                  <cpu id='86' socket_id='1' die_id='1' cluster_id='65535' core_id='18' siblings='38,86'/>
                  <cpu id='87' socket_id='1' die_id='1' cluster_id='65535' core_id='20' siblings='39,87'/>
                  <cpu id='88' socket_id='1' die_id='1' cluster_id='65535' core_id='21' siblings='40,88'/>
                  <cpu id='89' socket_id='1' die_id='1' cluster_id='65535' core_id='22' siblings='89'/>
                  <cpu id='90' socket_id='1' die_id='1' cluster_id='65535' core_id='24' siblings='90'/>
                  <cpu id='91' socket_id='1' die_id='1' cluster_id='65535' core_id='25' siblings='91'/>
                  <cpu id='92' socket_id='1' die_id='1' cluster_id='65535' core_id='26' siblings='92'/>
                  <cpu id='93' socket_id='1' die_id='1' cluster_id='65535' core_id='28' siblings='93'/>
                  <cpu id='94' socket_id='1' die_id='1' cluster_id='65535' core_id='29' siblings='94'/>
                  <cpu id='95' socket_id='1' die_id='1' cluster_id='65535' core_id='30' siblings='95'/>
                </cpus>
              </cell>
      

      The offlined cpus in the above example (41-47) are all reporting a socket id of 0 instead of 1.

      [root@edpm-compute-0 cloud-admin]# cat /sys/bus/cpu/devices/cpu42/online 
      0
      

      When checking the cpu topology before and after offlining a cpu unfortunately does not show the cpu information, e.g.

      [root@edpm-compute-0 cloud-admin]# cat /sys/bus/cpu/devices/cpu42/online 
      1
      [root@edpm-compute-0 cloud-admin]# grep . /sys/bus/cpu/devices/cpu*/topology/*
      ....REMOVED FOR BREVITY.....
      /sys/bus/cpu/devices/cpu41/topology/cluster_cpus:02000000,00000200,00000000
      /sys/bus/cpu/devices/cpu41/topology/cluster_cpus_list:41,89
      /sys/bus/cpu/devices/cpu41/topology/cluster_id:65535
      /sys/bus/cpu/devices/cpu41/topology/core_cpus:02000000,00000200,00000000
      /sys/bus/cpu/devices/cpu41/topology/core_cpus_list:41,89
      /sys/bus/cpu/devices/cpu41/topology/core_id:22
      /sys/bus/cpu/devices/cpu41/topology/core_siblings:ffffff00,0000ffff,ff000000
      /sys/bus/cpu/devices/cpu41/topology/core_siblings_list:24-47,72-95
      /sys/bus/cpu/devices/cpu41/topology/die_cpus:ffffff00,0000ffff,ff000000
      /sys/bus/cpu/devices/cpu41/topology/die_cpus_list:24-47,72-95
      /sys/bus/cpu/devices/cpu41/topology/die_id:1
      /sys/bus/cpu/devices/cpu41/topology/package_cpus:ffffff00,0000ffff,ff000000
      /sys/bus/cpu/devices/cpu41/topology/package_cpus_list:24-47,72-95
      /sys/bus/cpu/devices/cpu41/topology/physical_package_id:1
      /sys/bus/cpu/devices/cpu41/topology/ppin:0x2b55f59dfa78030
      /sys/bus/cpu/devices/cpu41/topology/thread_siblings:02000000,00000200,00000000
      /sys/bus/cpu/devices/cpu41/topology/thread_siblings_list:41,89
      /sys/bus/cpu/devices/cpu42/topology/cluster_cpus:04000000,00000400,00000000
      /sys/bus/cpu/devices/cpu42/topology/cluster_cpus_list:42,90
      /sys/bus/cpu/devices/cpu42/topology/cluster_id:65535
      /sys/bus/cpu/devices/cpu42/topology/core_cpus:04000000,00000400,00000000
      /sys/bus/cpu/devices/cpu42/topology/core_cpus_list:42,90
      /sys/bus/cpu/devices/cpu42/topology/core_id:24
      /sys/bus/cpu/devices/cpu42/topology/core_siblings:ffffff00,0000ffff,ff000000
      /sys/bus/cpu/devices/cpu42/topology/core_siblings_list:24-47,72-95
      /sys/bus/cpu/devices/cpu42/topology/die_cpus:ffffff00,0000ffff,ff000000
      /sys/bus/cpu/devices/cpu42/topology/die_cpus_list:24-47,72-95
      /sys/bus/cpu/devices/cpu42/topology/die_id:1
      /sys/bus/cpu/devices/cpu42/topology/package_cpus:ffffff00,0000ffff,ff000000
      /sys/bus/cpu/devices/cpu42/topology/package_cpus_list:24-47,72-95
      /sys/bus/cpu/devices/cpu42/topology/physical_package_id:1
      /sys/bus/cpu/devices/cpu42/topology/ppin:0x2b55f59dfa78030
      /sys/bus/cpu/devices/cpu42/topology/thread_siblings:04000000,00000400,00000000
      /sys/bus/cpu/devices/cpu42/topology/thread_siblings_list:42,90
      /sys/bus/cpu/devices/cpu43/topology/cluster_cpus:08000000,00000800,00000000
      /sys/bus/cpu/devices/cpu43/topology/cluster_cpus_list:43,91
      /sys/bus/cpu/devices/cpu43/topology/cluster_id:65535
      /sys/bus/cpu/devices/cpu43/topology/core_cpus:08000000,00000800,00000000
      /sys/bus/cpu/devices/cpu43/topology/core_cpus_list:43,91
      /sys/bus/cpu/devices/cpu43/topology/core_id:25
      /sys/bus/cpu/devices/cpu43/topology/core_siblings:ffffff00,0000ffff,ff000000
      /sys/bus/cpu/devices/cpu43/topology/core_siblings_list:24-47,72-95
      /sys/bus/cpu/devices/cpu43/topology/die_cpus:ffffff00,0000ffff,ff000000
      /sys/bus/cpu/devices/cpu43/topology/die_cpus_list:24-47,72-95
      /sys/bus/cpu/devices/cpu43/topology/die_id:1
      /sys/bus/cpu/devices/cpu43/topology/package_cpus:ffffff00,0000ffff,ff000000
      /sys/bus/cpu/devices/cpu43/topology/package_cpus_list:24-47,72-95
      /sys/bus/cpu/devices/cpu43/topology/physical_package_id:1
      /sys/bus/cpu/devices/cpu43/topology/ppin:0x2b55f59dfa78030
      /sys/bus/cpu/devices/cpu43/topology/thread_siblings:08000000,00000800,00000000
      /sys/bus/cpu/devices/cpu43/topology/thread_siblings_list:43,91
      
      echo 0 > /sys/bus/cpu/devices/cpu42/online
      grep . /sys/bus/cpu/devices/cpu*/topology/*
      .....REMOVED FOR BREVITY.....
      /sys/bus/cpu/devices/cpu41/topology/cluster_cpus:02000000,00000200,00000000
      /sys/bus/cpu/devices/cpu41/topology/cluster_cpus_list:41,89
      /sys/bus/cpu/devices/cpu41/topology/cluster_id:65535
      /sys/bus/cpu/devices/cpu41/topology/core_cpus:02000000,00000200,00000000
      /sys/bus/cpu/devices/cpu41/topology/core_cpus_list:41,89
      /sys/bus/cpu/devices/cpu41/topology/core_id:22
      /sys/bus/cpu/devices/cpu41/topology/core_siblings:ffffff00,0000fbff,ff000000
      /sys/bus/cpu/devices/cpu41/topology/core_siblings_list:24-41,43-47,72-95
      /sys/bus/cpu/devices/cpu41/topology/die_cpus:ffffff00,0000fbff,ff000000
      /sys/bus/cpu/devices/cpu41/topology/die_cpus_list:24-41,43-47,72-95
      /sys/bus/cpu/devices/cpu41/topology/die_id:1
      /sys/bus/cpu/devices/cpu41/topology/package_cpus:ffffff00,0000fbff,ff000000
      /sys/bus/cpu/devices/cpu41/topology/package_cpus_list:24-41,43-47,72-95
      /sys/bus/cpu/devices/cpu41/topology/physical_package_id:1
      /sys/bus/cpu/devices/cpu41/topology/ppin:0x2b55f59dfa78030
      /sys/bus/cpu/devices/cpu41/topology/thread_siblings:02000000,00000200,00000000
      /sys/bus/cpu/devices/cpu41/topology/thread_siblings_list:41,89
      /sys/bus/cpu/devices/cpu43/topology/cluster_cpus:08000000,00000800,00000000
      /sys/bus/cpu/devices/cpu43/topology/cluster_cpus_list:43,91
      /sys/bus/cpu/devices/cpu43/topology/cluster_id:65535
      /sys/bus/cpu/devices/cpu43/topology/core_cpus:08000000,00000800,00000000
      /sys/bus/cpu/devices/cpu43/topology/core_cpus_list:43,91
      /sys/bus/cpu/devices/cpu43/topology/core_id:25
      /sys/bus/cpu/devices/cpu43/topology/core_siblings:ffffff00,0000fbff,ff000000
      /sys/bus/cpu/devices/cpu43/topology/core_siblings_list:24-41,43-47,72-95
      /sys/bus/cpu/devices/cpu43/topology/die_cpus:ffffff00,0000fbff,ff000000
      /sys/bus/cpu/devices/cpu43/topology/die_cpus_list:24-41,43-47,72-95
      /sys/bus/cpu/devices/cpu43/topology/die_id:1
      /sys/bus/cpu/devices/cpu43/topology/package_cpus:ffffff00,0000fbff,ff000000
      /sys/bus/cpu/devices/cpu43/topology/package_cpus_list:24-41,43-47,72-95
      /sys/bus/cpu/devices/cpu43/topology/physical_package_id:1
      /sys/bus/cpu/devices/cpu43/topology/ppin:0x2b55f59dfa78030
      /sys/bus/cpu/devices/cpu43/topology/thread_siblings:08000000,00000800,00000000
      /sys/bus/cpu/devices/cpu43/topology/thread_siblings_list:43,91
      

      In this state Nova is enable to accurately read socket information.

      2024-07-16 17:01:44.544 1 DEBUG nova.pci.stats [None req-e8e48005-14a0-4d63-8837-1fe477b817e8 f8fffcb972e24b40b588e9d14b76f1b6 7c9ac4ef27f14c9bab273e577ec9d47a - - default default] No socket information in host NUMA cell(s). _filter_pools_for_socket_affinity /usr/lib/python3.9/site-packages/nova/pci/stats.py:474
      

      Confirmed workaround of disabling power managment and enabling all CPUs can bypass issues.

              alifshit@redhat.com Artom Lifshitz
              rhn-gps-jparker James Parker
              rhos-dfg-compute
              Votes:
              0 Vote for this issue
              Watchers:
              11 Start watching this issue

                Created:
                Updated:
                Resolved: