-
Bug
-
Resolution: Not a Bug
-
Undefined
-
None
-
4.16.z
-
None
-
None
-
False
-
-
Description of problem:
Pods requesting hugepages resources are not able to start, they fail with a message: Failed to get valid hugepages count
Version-Release number of selected component (if applicable):
4.16
How reproducible:
Always
Steps to Reproduce:
1. Deploy OCP 4.16 in Baremetal multinode cluster 2. Apply a performance Profile to worker nodes to setup 2M hugepages 3. Deploy an application that requires 2M hugepages
Actual results:
Pods remain in CreateContainerError status and logs show errors about: Failed to get valid hugepages count
Expected results:
Pods should be in Running and no errors displayed about hugepages.
Additional info:
This works in OCP 4.14 and below.
This is the log output of the failed pod:
> Namespace spk-dns46 NAME READY STATUS RESTARTS AGE IP NODE NOMINATED NODE READINESS GATES pod/f5-tmm-6d47ccbd75-j7ld4 2/3 CreateContainerError 0 11m 10.128.2.16 worker-2 <none> <none> > Retrieving log from container f5-tmm-6d47ccbd75-j7ld4/f5-tmm in namespace spk-dns46 + exec /opt/bin/mapres --tmm-tcl-file --tmm-args-name --allow-pipefail --info <do_delay> WARNING: TMM_MAPRES_DELAY_MS remaining: 5000 ms <is_k8s_in_cluster> INFO: Detected Kubernetes Environment <process_openshift_resources> WARNING: Environment contains PCIDEVICE_OPENSHIFT_IO_RES0SPKDNS46_INFO, not referenced by any OPENSHIFT_VFIO_RESOURCE_#. <process_openshift_resources> WARNING: Environment contains PCIDEVICE_OPENSHIFT_IO_RES1SPKDNS46_INFO, not referenced by any OPENSHIFT_VFIO_RESOURCE_#. NAME BUS SURVEY U s# ksq IPADDR MAC eth0 KERNEL D 05 000 10.128.2.16/23 0a:58:0a:80:02:10 eth0 KERNEL D 07 000 fd02:0000:0000:0005:0000:0000:0000:0010/64 0a:58:0a:80:02:10 eth0 KERNEL D 08 000 fe80:0000:0000:0000:0858:0aff:fe80:0210/64 0a:58:0a:80:02:10 net1 0000:12:01.0 USER D 00 256 fe80:0000:0000:0000:9088:23ff:fe60:ad48/64 92:88:23:60:ad:48 net2 0000:12:05.3 USER D 01 257 fe80:0000:0000:0000:8055:65ff:fe57:92ea/64 82:55:65:57:92:ea ### END DUMP ### <enforce_single_numa_node> INFO: PCI device 0000:12:01.0 is in NUMA node 0 <enforce_single_numa_node> INFO: PCI device 0000:12:05.3 is in NUMA node 0 No TMM_MAPRES_IFC_CLAIM_LIST or empty. <get_prov_yaml_data> INFO: Loading /opt/lib/mapres/prov.yaml <read_prov_yaml_mem> INFO: Found spk specific mem_2mb_pages_per_cpu value: '768' <read_hugepages_cgroup_limit> ERROR: File I/O error (-5) while reading /sys/fs/cgroup/hugetlb/hugetlb.2MB.limit_in_bytes <read_hugepages_cgroup_limit> INFO: Returning cgroup_limit = '0' <get_hugepages> ERROR: USE_PHYS_MEM is enabled but no hugepages specified either through TMM_MAPRES_HUGEPAGES env var or /sys/fs/cgroup/hugetlb/hugetlb.2MB.limit_in_bytes file <get_tmm_args> ERROR: Failed to get valid hugepages count <main> ERROR: Failed to create cmdline args core config file does not exist > Retrieving log from container f5-tmm-6d47ccbd75-j7ld4/debug in namespace spk-dns46 2024/08/29 22:28:17 WARN init: PAL_SYSINFO is missing or empty qkview-collect-daemon:start debug-sidecar: start qkview-collect Details: Client config details {Base:/etc/qkview-collect Overlay:/etc/qkview-collect/qkview-collect.config.yml GlobalTimeout:-1s LocalTimeout:-1s Outfile:/tmp/qkview.tar.gz PkgType:container MaxFileSize:25 RemovePrivateKeyFromFiles:true} base config file... qkview-collect Details: Environment details &{IsDevVersion:false HostMode:false TLSCABundle:/etc/ssl/certs/ca-root-cert.pem TLSCertificateFile:/etc/ssl/certs/server-cert.pem TLSKeyFile:/etc/ssl/certs/server-key.pem TLSCertRetryWait:5s SecureOnly:true UsingCertOrchestrator:true ContainerName:f5-debug-sidecar GrpcPort:19892 MaxFileSize:25 BaseCfgPath:/etc/qkview-collect ContainerOverlayPath:/etc/qkview-collect/qkview-collect.config.yml TotalCollectionTimeout:-1s IndividualCmdTimeout:-1s Outfile:/tmp/qkview.tar.gz PkgType:container RemovePrivateKeyFromFiles:true} base config file... qkview-collect Info: Starting GRPC server in secured mode qkview-collect Info: starting secure server 2024-08-29T22:28:17.452Z [DEBU] logging/logger.go:65 ESC[0mLogging Level: debug
This is the performance profile:
--- kind: PerformanceProfile apiVersion: "performance.openshift.io/v2" metadata: name: blueprint-profile spec: cpu: isolated: "1-19,21-39,41-59,61-79" reserved: "0,40,20,60" additionalKernelArgs: - nohz_full=1-19,21-39,41-59,61-79 hugepages: pages: - size: "1G" count: 32 node: 0 - size: "1G" count: 32 node: 1 - size: "2M" count: 12000 node: 0 - size: "2M" count: 12000 node: 1 realTimeKernel: enabled: false workloadHints: realTime: false highPowerConsumption: false perPodPowerManagement: true net: userLevelNetworking: false numa: topologyPolicy: "single-numa-node" nodeSelector: node-role.kubernetes.io/worker: "" ...
In the sosreport from the node we can see the node has hugepages.
$ grep -i hugepages proc/meminfo AnonHugePages: 198656 kB ShmemHugePages: 0 kB FileHugePages: 0 kB HugePages_Total: 24000 HugePages_Free: 24000 HugePages_Rsvd: 0 HugePages_Surp: 0 Hugepagesize: 2048 kB