-
Bug
-
Resolution: Done-Errata
-
Undefined
-
None
-
4.18.0, 4.19
-
None
Description of problem:
Applying a performance profile on an ARM cluster, results with the tuned profile to turn degraded
Version-Release number of selected component (if applicable):
How reproducible:
Always
Steps to Reproduce:
1. Label a worker node with a worker-cnf label 2. Create an mcp referring to that label 3. Apply the below performance profile apiVersion: performance.openshift.io/v2 kind: PerformanceProfile metadata: name: performance spec: cpu: isolated: "1-3,4-6" reserved: "0,7" hugepages: defaultHugepagesSize: 512M pages: - count: 1 node: 0 size: 512M - count: 128 node: 1 size: 2M machineConfigPoolSelector: machineconfiguration.openshift.io/role: worker-cnf net: userLevelNetworking: true nodeSelector: node-role.kubernetes.io/worker-cnf: '' kernelPageSize: 64k numa: topologyPolicy: single-numa-node realTimeKernel: enabled: false workloadHints: highPowerConsumption: true perPodPowerManagement: false realTime: true
Actual results:
Expected results:
Additional info:
[root@ampere-one-x-04 ~]# oc get profiles -A
NAMESPACE NAME TUNED APPLIED DEGRADED MESSAGE AGE
openshift-cluster-node-tuning-operator ocp-ctlplane-0.libvirt.lab.eng.tlv2.redhat.com openshift-control-plane True False TuneD profile applied. 22h
openshift-cluster-node-tuning-operator ocp-ctlplane-1.libvirt.lab.eng.tlv2.redhat.com openshift-control-plane True False TuneD profile applied. 22h
openshift-cluster-node-tuning-operator ocp-ctlplane-2.libvirt.lab.eng.tlv2.redhat.com openshift-control-plane True False TuneD profile applied. 22h
openshift-cluster-node-tuning-operator ocp-worker-0.libvirt.lab.eng.tlv2.redhat.com openshift-node-performance-performance False True The TuneD daemon profile not yet applied, or application failed. 22h
openshift-cluster-node-tuning-operator ocp-worker-1.libvirt.lab.eng.tlv2.redhat.com openshift-node True False TuneD profile applied. 22h
openshift-cluster-node-tuning-operator ocp-worker-2.libvirt.lab.eng.tlv2.redhat.com openshift-node True False TuneD profile applied. 22h
[root@ampere-one-x-04 ~]# oc describe performanceprofile
Name: performance
Namespace:
Labels: <none>
Annotations: <none>
API Version: performance.openshift.io/v2
Kind: PerformanceProfile
Metadata:
Creation Timestamp: 2025-03-04T15:28:44Z
Finalizers:
foreground-deletion
Generation: 1
Resource Version: 74234
UID: 0d9c1817-c12f-4ea8-9c4b-b37badc232e9
Spec:
Cpu:
Isolated: 1-3,4-6
Reserved: 0,7
Hugepages:
Default Hugepages Size: 512M
Pages:
Count: 1
Node: 0
Size: 512M
Count: 128
Node: 1
Size: 2M
Kernel Page Size: 64k
Machine Config Pool Selector:
machineconfiguration.openshift.io/role: worker-cnf
Net:
User Level Networking: true
Node Selector:
node-role.kubernetes.io/worker-cnf:
Numa:
Topology Policy: single-numa-node
Real Time Kernel:
Enabled: false
Workload Hints:
High Power Consumption: true
Per Pod Power Management: false
Real Time: true
Status:
Conditions:
Last Heartbeat Time: 2025-03-04T15:28:45Z
Last Transition Time: 2025-03-04T15:28:45Z
Status: False
Type: Available
Last Heartbeat Time: 2025-03-04T15:28:45Z
Last Transition Time: 2025-03-04T15:28:45Z
Status: False
Type: Upgradeable
Last Heartbeat Time: 2025-03-04T15:28:45Z
Last Transition Time: 2025-03-04T15:28:45Z
Status: False
Type: Progressing
Last Heartbeat Time: 2025-03-04T15:28:45Z
Last Transition Time: 2025-03-04T15:28:45Z
Message: Tuned ocp-worker-0.libvirt.lab.eng.tlv2.redhat.com Degraded Reason: TunedError.
Tuned ocp-worker-0.libvirt.lab.eng.tlv2.redhat.com Degraded Message: TuneD daemon issued one or more error message(s) during profile application. TuneD stderr: .
Reason: TunedProfileDegraded
Status: True
Type: Degraded
Runtime Class: performance-performance
Tuned: openshift-cluster-node-tuning-operator/openshift-node-performance-performance
Events:
Type Reason Age From Message
---- ------ ---- ---- ------- Normal Creation succeeded 112m (x9 over 17h) performance-profile-controller Succeeded to create all components
[root@ampere-one-x-04 ~]# oc logs pod/tuned-kjc8j
I0304 15:35:50.346412 3259 controller.go:1666] starting in-cluster ocp-tuned v4.19.0-202502262344.p0.gf166846.assembly.stream.el9-0-g0d9dd16-dirty
I0304 15:35:50.401840 3259 controller.go:671] writing /var/lib/ocp-tuned/image.env
I0304 15:35:50.418669 3259 controller.go:702] tunedRecommendFileRead(): read "openshift-node-performance-performance" from "/etc/tuned/recommend.d/50-openshift.conf"
I0304 15:35:50.419585 3259 controller.go:1728] starting: profile unpacked is "openshift-node-performance-performance" fingerprint "ab0d99d8009d6539b91ed1aeff3e4fa1c629c1cd4e9a32bdc132dcc9737e4fc9"
I0304 15:35:50.419646 3259 controller.go:1424] recover: no pending deferred change
I0304 15:35:50.419666 3259 controller.go:1734] starting: no pending deferred update
I0304 15:36:06.074575 3259 controller.go:382] disabling system tuned...
I0304 15:36:06.121045 3259 controller.go:1546] started events processors
I0304 15:36:06.121492 3259 controller.go:359] set log level 0
I0304 15:36:06.121850 3259 controller.go:1567] monitoring filesystem events on "/etc/tuned/bootcmdline"
I0304 15:36:06.121886 3259 controller.go:1570] started controller
I0304 15:36:06.122603 3259 controller.go:692] tunedRecommendFileWrite(): written "/etc/tuned/recommend.d/50-openshift.conf" to set TuneD profile openshift-node-performance-performance
I0304 15:36:06.122634 3259 controller.go:417] profilesExtract(): extracting 6 TuneD profiles (recommended=openshift-node-performance-performance)
I0304 15:36:06.210862 3259 controller.go:462] profilesExtract(): recommended TuneD profile openshift-node-performance-performance content unchanged [openshift]
I0304 15:36:06.211950 3259 controller.go:462] profilesExtract(): recommended TuneD profile openshift-node-performance-performance content unchanged [openshift-node-performance-performance]
I0304 15:36:06.212311 3259 controller.go:478] profilesExtract(): fingerprint of extracted profiles: "ab0d99d8009d6539b91ed1aeff3e4fa1c629c1cd4e9a32bdc132dcc9737e4fc9"
I0304 15:36:06.212389 3259 controller.go:818] tunedReload()
I0304 15:36:06.212493 3259 controller.go:745] starting tuned...
I0304 15:36:06.212547 3259 run.go:121] running cmd...
2025-03-04 15:36:06,335 INFO tuned.daemon.application: TuneD: 2.25.1, kernel: 5.14.0-570.el9.aarch64+64k
2025-03-04 15:36:06,335 INFO tuned.daemon.application: dynamic tuning is globally disabled
2025-03-04 15:36:06,340 INFO tuned.daemon.daemon: using sleep interval of 1 second(s)
2025-03-04 15:36:06,340 INFO tuned.daemon.daemon: Running in automatic mode, checking what profile is recommended for your configuration.
2025-03-04 15:36:06,341 INFO tuned.daemon.daemon: Using 'openshift-node-performance-performance' profile
2025-03-04 15:36:06,342 INFO tuned.profiles.loader: loading profile: openshift-node-performance-performance
2025-03-04 15:36:06,460 ERROR tuned.daemon.daemon: Cannot set initial profile. No tunings will be enabled: Cannot load profile(s) 'openshift-node-performance-performance': Cannot find profile 'openshift-node-performance--aarch64-performance' in '['/var/lib/ocp-tuned/profiles', '/usr/lib/tuned', '/usr/lib/tuned/profiles']'.
2025-03-04 15:36:06,461 INFO tuned.daemon.controller: starting controller
sh-5.1# systemctl --no-pager | grep hugepages
dev-hugepages.mount loaded active mounted Huge Pages File System
● hugepages-allocation-2048kB-NUMA1.service loaded failed failed Hugepages-2048kB allocation on the node 1
hugepages-allocation-524288kB-NUMA0.service loaded active exited Hugepages-524288kB allocation on the node 0
sh-5.1# systemctl status hugepages-allocation-2048kB-NUMA1.service
× hugepages-allocation-2048kB-NUMA1.service - Hugepages-2048kB allocation on the node 1
Loaded: loaded (/etc/systemd/system/hugepages-allocation-2048kB-NUMA1.service; enabled; preset: disabled)
Active: failed (Result: exit-code) since Tue 2025-03-04 15:32:33 UTC; 17h ago
Main PID: 1002 (code=exited, status=1/FAILURE)
CPU: 6ms
Mar 04 15:32:33 ocp-worker-0.libvirt.lab.eng.tlv2.redhat.com systemd[1]: Starting Hugepages-2048kB allocation on the node 1...
Mar 04 15:32:33 ocp-worker-0.libvirt.lab.eng.tlv2.redhat.com hugepages-allocation.sh[1002]: ERROR: /sys/devices/system/node/node1/hugepages/hugepages-2048kB/nr_hugepages does not exist
Mar 04 15:32:33 ocp-worker-0.libvirt.lab.eng.tlv2.redhat.com systemd[1]: hugepages-allocation-2048kB-NUMA1.service: Main process exited, code=exited, status=1/FAILURE
Mar 04 15:32:33 ocp-worker-0.libvirt.lab.eng.tlv2.redhat.com systemd[1]: hugepages-allocation-2048kB-NUMA1.service: Failed with result 'exit-code'.
Mar 04 15:32:33 ocp-worker-0.libvirt.lab.eng.tlv2.redhat.com systemd[1]: Failed to start Hugepages-2048kB allocation on the node 1.
sh-5.1# cat /proc/cmdline
BOOT_IMAGE=(hd0,gpt3)/boot/ostree/rhcos-e032e3de5cffeccaf88bc5dc1945da35b4273c5f5b758a6ca1d0d78344b55e7f/vmlinuz-5.14.0-570.el9.aarch64+64k rw ostree=/ostree/boot.0/rhcos/e032e3de5cffeccaf88bc5dc1945da35b4273c5f5b758a6ca1d0d78344b55e7f/0 ignition.platform.id=openstack console=ttyAMA0,115200n8 console=tty0 root=UUID=96763b3b-e217-4879-a03e-56568ca84bf9 rw rootflags=prjquota boot=UUID=d98055a6-2355-40d3-8e87-98eedd0e8c91 systemd.unified_cgroup_hierarchy=1 cgroup_no_v1=all psi=0
bash-5.1# ls /var/lib/ocp-tuned/profiles/ openshift openshift-node-performance-intel-x86-performance openshift-node-performance-amd-x86-performance openshift-node-performance-performance openshift-node-performance-arm-aarch64-performance openshift-node-performance-rt-performance bash-5.1# cat /var/lib/ocp-tuned/profiles/openshift-node-performance-performance/tuned.conf [main] summary=Openshift node optimized for deterministic performance at the cost of increased power consumption, focused on low latency network performance. Based on Tuned 2.11 and Cluster node tuning (oc 4.5) # The final result of the include depends on cpu vendor, cpu architecture, and whether the real time kernel is enabled # The first line will be evaluated based on the CPU vendor and architecture # This has three possible results: # include=openshift-node-performance-amd-x86; # include=openshift-node-performance-arm-aarch64; # include=openshift-node-performance-intel-x86; # The second line will be evaluated based on whether the real time kernel is enabled # This has two possible results: # openshift-node,cpu-partitioning # openshift-node,cpu-partitioning,openshift-node-performance-rt-<PerformanceProfile name> include=openshift-node,cpu-partitioning${f:regex_search_ternary:${f:exec:uname:-r}:rt:,openshift-node-performance-rt-performance:}; openshift-node-performance-${f:lscpu_check:Vendor ID\:\s*GenuineIntel:intel:Vendor ID\:\s*AuthenticAMD:amd:Vendor ID\:\s*ARM:arm}-${f:lscpu_check:Architecture\:\s*x86_64:x86:Architecture\:\s*aarch64:aarch64}-performance # Inheritance of base profiles legend: # cpu-partitioning -> network-latency -> latency-performance # https://github.com/redhat-performance/tuned/blob/master/profiles/latency-performance/tuned.conf # https://github.com/redhat-performance/tuned/blob/master/profiles/network-latency/tuned.conf # https://github.com/redhat-performance/tuned/blob/master/profiles/cpu-partitioning/tuned.conf # All values are mapped with a comment where a parent profile contains them. # Different values will override the original values in parent profiles. [variables] #> isolated_cores take a list of ranges; e.g. isolated_cores=2,4-7 isolated_cores=1-6 not_isolated_cores_expanded=${f:cpulist_invert:${isolated_cores_expanded}} [cpu] #> latency-performance #> (override) force_latency=cstate.id:1|3 governor=performance energy_perf_bias=performance min_perf_pct=100 [service] service.stalld=start,enable [vm] #> network-latency transparent_hugepages=never [irqbalance] # Disable the plugin entirely, which was enabled by the parent profile `cpu-partitioning`. # It can be racy if TuneD restarts for whatever reason. #> cpu-partitioning enabled=false [scheduler] runtime=0 group.ksoftirqd=0:f:11:*:ksoftirqd.* group.rcuc=0:f:11:*:rcuc.* group.ktimers=0:f:11:*:ktimers.* default_irq_smp_affinity = ignore irq_process=false [sysctl] #> cpu-partitioning #RealTimeHint kernel.hung_task_timeout_secs=600 #> cpu-partitioning #RealTimeHint kernel.nmi_watchdog=0 #> RealTimeHint kernel.sched_rt_runtime_us=-1 #> cpu-partitioning #RealTimeHint vm.stat_interval=10 # cpu-partitioning and RealTimeHint for RHEL disable it (= 0) # OCP is too dynamic when partitioning and needs to evacuate #> scheduled timers when starting a guaranteed workload (= 1) kernel.timer_migration=1 #> network-latency net.ipv4.tcp_fastopen=3 # If a workload mostly uses anonymous memory and it hits this limit, the entire # working set is buffered for I/O, and any more write buffering would require # swapping, so it's time to throttle writes until I/O can catch up. Workloads # that mostly use file mappings may be able to use even higher values. # # The generator of dirty data starts writeback at this percentage (system default # is 20%) #> latency-performance vm.dirty_ratio=10 # Start background writeback (via writeback threads) at this percentage (system # default is 10%) #> latency-performance vm.dirty_background_ratio=3 # The swappiness parameter controls the tendency of the kernel to move # processes out of physical memory and onto the swap disk. # 0 tells the kernel to avoid swapping processes out of physical memory # for as long as possible # 100 tells the kernel to aggressively swap processes out of physical memory # and move them to swap cache #> latency-performance vm.swappiness=10 # also configured via a sysctl.d file # placed here for documentation purposes and commented out due # to a tuned logging bug complaining about duplicate sysctl: # https://issues.redhat.com/browse/RHEL-18972 #> rps configuration # net.core.rps_default_mask=${not_isolated_cpumask} [selinux] #> Custom (atomic host) avc_cache_threshold=8192 [net] channels=combined 2 nf_conntrack_hashsize=131072 [bootloader] # !! The names are important for Intel and are referenced in openshift-node-performance-intel-x86 # set empty values to disable RHEL initrd setting in cpu-partitioning initrd_remove_dir= initrd_dst_img= initrd_add_dir= # overrides cpu-partitioning cmdline cmdline_cpu_part=+nohz=on rcu_nocbs=${isolated_cores} tuned.non_isolcpus=${not_isolated_cpumask} systemd.cpu_affinity=${not_isolated_cores_expanded} # No default value but will be composed conditionally based on platform cmdline_iommu= cmdline_isolation=+isolcpus=managed_irq,${isolated_cores} cmdline_realtime_nohzfull=+nohz_full=${isolated_cores} cmdline_realtime_nosoftlookup=+nosoftlockup cmdline_realtime_common=+skew_tick=1 rcutree.kthread_prio=11 # No default value but will be composed conditionally based on platform cmdline_power_performance= # No default value but will be composed conditionally based on platform cmdline_idle_poll= [rtentsk]
- clones
-
OCPBUGS-52352 Tuned profile degraded in ARM on Vendor Id not matching Ampere (APM)
-
- Closed
-
- depends on
-
OCPBUGS-52352 Tuned profile degraded in ARM on Vendor Id not matching Ampere (APM)
-
- Closed
-
- links to
-
RHBA-2025:2705
OpenShift Container Platform 4.18.z bug fix update