Uploaded image for project: 'OpenShift Bugs'
  1. OpenShift Bugs
  2. OCPBUGS-41033

Failed to get valid hugepages count - Pods not able to start

XMLWordPrintable

    • Icon: Bug Bug
    • Resolution: Not a Bug
    • Icon: Undefined Undefined
    • None
    • 4.16.z
    • None
    • None
    • False
    • Hide

      None

      Show
      None

      Description of problem:

      Pods requesting hugepages resources are not able to start, they fail with a message: Failed to get valid hugepages count
          

      Version-Release number of selected component (if applicable):

      4.16
          

      How reproducible:

      Always
          

      Steps to Reproduce:

          1. Deploy OCP 4.16 in Baremetal multinode cluster
          2. Apply a performance Profile to worker nodes to setup 2M hugepages
          3. Deploy an application that requires 2M hugepages
          

      Actual results:

      Pods remain in CreateContainerError status and logs show errors about: Failed to get valid hugepages count
          

      Expected results:

      Pods should be in Running and no errors displayed about hugepages.
          

      Additional info:

      This works in OCP 4.14 and below.
          

      This is the log output of the failed pod:

      > Namespace spk-dns46
      NAME                                       READY   STATUS                 RESTARTS   AGE   IP            NODE       NOMINATED NODE   READINESS GATES
      pod/f5-tmm-6d47ccbd75-j7ld4                2/3     CreateContainerError   0          11m   10.128.2.16   worker-2   <none>           <none>
      
      > Retrieving log from container f5-tmm-6d47ccbd75-j7ld4/f5-tmm in namespace spk-dns46
      + exec /opt/bin/mapres --tmm-tcl-file --tmm-args-name --allow-pipefail --info
      <do_delay> WARNING: TMM_MAPRES_DELAY_MS remaining: 5000 ms
      <is_k8s_in_cluster> INFO: Detected Kubernetes Environment
      <process_openshift_resources> WARNING: Environment contains PCIDEVICE_OPENSHIFT_IO_RES0SPKDNS46_INFO, not referenced by any OPENSHIFT_VFIO_RESOURCE_#. 
      <process_openshift_resources> WARNING: Environment contains PCIDEVICE_OPENSHIFT_IO_RES1SPKDNS46_INFO, not referenced by any OPENSHIFT_VFIO_RESOURCE_#. 
      NAME            BUS          SURVEY   U s# ksq  IPADDR                                      MAC              
      eth0                         KERNEL   D 05 000  10.128.2.16/23                              0a:58:0a:80:02:10
      eth0                         KERNEL   D 07 000  fd02:0000:0000:0005:0000:0000:0000:0010/64  0a:58:0a:80:02:10
      eth0                         KERNEL   D 08 000  fe80:0000:0000:0000:0858:0aff:fe80:0210/64  0a:58:0a:80:02:10
      net1            0000:12:01.0 USER     D 00 256  fe80:0000:0000:0000:9088:23ff:fe60:ad48/64  92:88:23:60:ad:48
      net2            0000:12:05.3 USER     D 01 257  fe80:0000:0000:0000:8055:65ff:fe57:92ea/64  82:55:65:57:92:ea
      ### END DUMP ###
      <enforce_single_numa_node> INFO: PCI device 0000:12:01.0 is in NUMA node 0
      <enforce_single_numa_node> INFO: PCI device 0000:12:05.3 is in NUMA node 0
      No TMM_MAPRES_IFC_CLAIM_LIST or empty.
      <get_prov_yaml_data> INFO: Loading /opt/lib/mapres/prov.yaml
      <read_prov_yaml_mem> INFO: Found spk specific mem_2mb_pages_per_cpu value: '768'
      <read_hugepages_cgroup_limit> ERROR: File I/O error (-5) while reading /sys/fs/cgroup/hugetlb/hugetlb.2MB.limit_in_bytes
      <read_hugepages_cgroup_limit> INFO: Returning cgroup_limit = '0'
      <get_hugepages> ERROR: USE_PHYS_MEM is enabled but no hugepages specified either through TMM_MAPRES_HUGEPAGES env var or /sys/fs/cgroup/hugetlb/hugetlb.2MB.limit_in_bytes file
      <get_tmm_args> ERROR: Failed to get valid hugepages count
      <main> ERROR: Failed to create cmdline args
      core config file does not exist
      > Retrieving log from container f5-tmm-6d47ccbd75-j7ld4/debug in namespace spk-dns46
      2024/08/29 22:28:17 WARN init: PAL_SYSINFO is missing or empty
      qkview-collect-daemon:start
      debug-sidecar: start
      qkview-collect Details: Client config details {Base:/etc/qkview-collect Overlay:/etc/qkview-collect/qkview-collect.config.yml GlobalTimeout:-1s LocalTimeout:-1s Outfile:/tmp/qkview.tar.gz PkgType:container MaxFileSize:25 RemovePrivateKeyFromFiles:true} base config file...
      qkview-collect Details: Environment details &{IsDevVersion:false HostMode:false TLSCABundle:/etc/ssl/certs/ca-root-cert.pem TLSCertificateFile:/etc/ssl/certs/server-cert.pem TLSKeyFile:/etc/ssl/certs/server-key.pem TLSCertRetryWait:5s SecureOnly:true UsingCertOrchestrator:true ContainerName:f5-debug-sidecar GrpcPort:19892 MaxFileSize:25 BaseCfgPath:/etc/qkview-collect ContainerOverlayPath:/etc/qkview-collect/qkview-collect.config.yml TotalCollectionTimeout:-1s IndividualCmdTimeout:-1s Outfile:/tmp/qkview.tar.gz PkgType:container RemovePrivateKeyFromFiles:true} base config file...
      qkview-collect Info: Starting GRPC server in secured mode
      qkview-collect Info: starting secure server
      2024-08-29T22:28:17.452Z [DEBU] logging/logger.go:65                          ESC[0mLogging Level: debug
      

      This is the performance profile:

      ---
      kind: PerformanceProfile
      apiVersion: "performance.openshift.io/v2"
      metadata:
        name: blueprint-profile
      spec:
        cpu:
          isolated: "1-19,21-39,41-59,61-79"
          reserved: "0,40,20,60"
        additionalKernelArgs:
          - nohz_full=1-19,21-39,41-59,61-79
        hugepages:
          pages:
            - size: "1G"
              count: 32
              node: 0
            - size: "1G"
              count: 32
              node: 1
            - size: "2M"
              count: 12000
              node: 0
            - size: "2M"
              count: 12000
              node: 1
        realTimeKernel:
          enabled: false
        workloadHints:
          realTime: false
          highPowerConsumption: false
          perPodPowerManagement: true
        net:
          userLevelNetworking: false
        numa:
          topologyPolicy: "single-numa-node"
        nodeSelector:
          node-role.kubernetes.io/worker: ""
      ...
      

      In the sosreport from the node we can see the node has hugepages.

      $ grep -i hugepages proc/meminfo 
      AnonHugePages:    198656 kB
      ShmemHugePages:        0 kB
      FileHugePages:         0 kB
      HugePages_Total:   24000
      HugePages_Free:    24000
      HugePages_Rsvd:        0
      HugePages_Surp:        0
      Hugepagesize:       2048 kB
      
      

              msivak@redhat.com Martin Sivak
              rhn-gps-manrodri Manuel Rodriguez
              Mallapadi Niranjan Mallapadi Niranjan
              Votes:
              0 Vote for this issue
              Watchers:
              6 Start watching this issue

                Created:
                Updated:
                Resolved: