Loading...

XML

Word

Printable

Type: Bug
Resolution: Not a Bug
Priority: Undefined
Fix Version/s: None
Affects Version/s: 4.16.z
Component/s: Performance Addon Operator
Labels:
None

Regression:
None
Blocked:
False
Blocked Reason:

Hide

None

Show
None
RH Private Keywords:

SFDC Cases Counter:
SFDC Cases Open:
SFDC Cases Links:

Description of problem:

Pods requesting hugepages resources are not able to start, they fail with a message: Failed to get valid hugepages count

Version-Release number of selected component (if applicable):

4.16

How reproducible:

Always

Steps to Reproduce:

    1. Deploy OCP 4.16 in Baremetal multinode cluster
    2. Apply a performance Profile to worker nodes to setup 2M hugepages
    3. Deploy an application that requires 2M hugepages

Actual results:

Pods remain in CreateContainerError status and logs show errors about: Failed to get valid hugepages count

Expected results:

Pods should be in Running and no errors displayed about hugepages.

Additional info:

This works in OCP 4.14 and below.

This is the log output of the failed pod:

> Namespace spk-dns46
NAME                                       READY   STATUS                 RESTARTS   AGE   IP            NODE       NOMINATED NODE   READINESS GATES
pod/f5-tmm-6d47ccbd75-j7ld4                2/3     CreateContainerError   0          11m   10.128.2.16   worker-2   <none>           <none>

> Retrieving log from container f5-tmm-6d47ccbd75-j7ld4/f5-tmm in namespace spk-dns46
+ exec /opt/bin/mapres --tmm-tcl-file --tmm-args-name --allow-pipefail --info
<do_delay> WARNING: TMM_MAPRES_DELAY_MS remaining: 5000 ms
<is_k8s_in_cluster> INFO: Detected Kubernetes Environment
<process_openshift_resources> WARNING: Environment contains PCIDEVICE_OPENSHIFT_IO_RES0SPKDNS46_INFO, not referenced by any OPENSHIFT_VFIO_RESOURCE_#. 
<process_openshift_resources> WARNING: Environment contains PCIDEVICE_OPENSHIFT_IO_RES1SPKDNS46_INFO, not referenced by any OPENSHIFT_VFIO_RESOURCE_#. 
NAME            BUS          SURVEY   U s# ksq  IPADDR                                      MAC              
eth0                         KERNEL   D 05 000  10.128.2.16/23                              0a:58:0a:80:02:10
eth0                         KERNEL   D 07 000  fd02:0000:0000:0005:0000:0000:0000:0010/64  0a:58:0a:80:02:10
eth0                         KERNEL   D 08 000  fe80:0000:0000:0000:0858:0aff:fe80:0210/64  0a:58:0a:80:02:10
net1            0000:12:01.0 USER     D 00 256  fe80:0000:0000:0000:9088:23ff:fe60:ad48/64  92:88:23:60:ad:48
net2            0000:12:05.3 USER     D 01 257  fe80:0000:0000:0000:8055:65ff:fe57:92ea/64  82:55:65:57:92:ea
### END DUMP ###
<enforce_single_numa_node> INFO: PCI device 0000:12:01.0 is in NUMA node 0
<enforce_single_numa_node> INFO: PCI device 0000:12:05.3 is in NUMA node 0
No TMM_MAPRES_IFC_CLAIM_LIST or empty.
<get_prov_yaml_data> INFO: Loading /opt/lib/mapres/prov.yaml
<read_prov_yaml_mem> INFO: Found spk specific mem_2mb_pages_per_cpu value: '768'
<read_hugepages_cgroup_limit> ERROR: File I/O error (-5) while reading /sys/fs/cgroup/hugetlb/hugetlb.2MB.limit_in_bytes
<read_hugepages_cgroup_limit> INFO: Returning cgroup_limit = '0'
<get_hugepages> ERROR: USE_PHYS_MEM is enabled but no hugepages specified either through TMM_MAPRES_HUGEPAGES env var or /sys/fs/cgroup/hugetlb/hugetlb.2MB.limit_in_bytes file
<get_tmm_args> ERROR: Failed to get valid hugepages count
<main> ERROR: Failed to create cmdline args
core config file does not exist
> Retrieving log from container f5-tmm-6d47ccbd75-j7ld4/debug in namespace spk-dns46
2024/08/29 22:28:17 WARN init: PAL_SYSINFO is missing or empty
qkview-collect-daemon:start
debug-sidecar: start
qkview-collect Details: Client config details {Base:/etc/qkview-collect Overlay:/etc/qkview-collect/qkview-collect.config.yml GlobalTimeout:-1s LocalTimeout:-1s Outfile:/tmp/qkview.tar.gz PkgType:container MaxFileSize:25 RemovePrivateKeyFromFiles:true} base config file...
qkview-collect Details: Environment details &{IsDevVersion:false HostMode:false TLSCABundle:/etc/ssl/certs/ca-root-cert.pem TLSCertificateFile:/etc/ssl/certs/server-cert.pem TLSKeyFile:/etc/ssl/certs/server-key.pem TLSCertRetryWait:5s SecureOnly:true UsingCertOrchestrator:true ContainerName:f5-debug-sidecar GrpcPort:19892 MaxFileSize:25 BaseCfgPath:/etc/qkview-collect ContainerOverlayPath:/etc/qkview-collect/qkview-collect.config.yml TotalCollectionTimeout:-1s IndividualCmdTimeout:-1s Outfile:/tmp/qkview.tar.gz PkgType:container RemovePrivateKeyFromFiles:true} base config file...
qkview-collect Info: Starting GRPC server in secured mode
qkview-collect Info: starting secure server
2024-08-29T22:28:17.452Z [DEBU] logging/logger.go:65                          ESC[0mLogging Level: debug

This is the performance profile:

---
kind: PerformanceProfile
apiVersion: "performance.openshift.io/v2"
metadata:
  name: blueprint-profile
spec:
  cpu:
    isolated: "1-19,21-39,41-59,61-79"
    reserved: "0,40,20,60"
  additionalKernelArgs:
    - nohz_full=1-19,21-39,41-59,61-79
  hugepages:
    pages:
      - size: "1G"
        count: 32
        node: 0
      - size: "1G"
        count: 32
        node: 1
      - size: "2M"
        count: 12000
        node: 0
      - size: "2M"
        count: 12000
        node: 1
  realTimeKernel:
    enabled: false
  workloadHints:
    realTime: false
    highPowerConsumption: false
    perPodPowerManagement: true
  net:
    userLevelNetworking: false
  numa:
    topologyPolicy: "single-numa-node"
  nodeSelector:
    node-role.kubernetes.io/worker: ""
...

In the sosreport from the node we can see the node has hugepages.

$ grep -i hugepages proc/meminfo 
AnonHugePages:    198656 kB
ShmemHugePages:        0 kB
FileHugePages:         0 kB
HugePages_Total:   24000
HugePages_Free:    24000
HugePages_Rsvd:        0
HugePages_Surp:        0
Hugepagesize:       2048 kB

Assignee:: Martin Sivak

Reporter:: Manuel Rodriguez

QA Contact:: Mallapadi Niranjan

Votes:: 0 Vote for this issue

Watchers:: 6 Start watching this issue

Created:: 2024/09/04 10:21 PM

Updated:: 2024/09/20 8:04 AM

Resolved:: 2024/09/18 9:46 AM

Details

Description

Attachments

Easy Agile Planning Poker

Activity

People

Dates