-
Bug
-
Resolution: Duplicate
-
Major
-
None
-
4.12
-
None
-
Important
-
No
-
5
-
CNF Compute Sprint 248, CNF Compute Sprint 249, CNF Compute Sprint 250
-
3
-
False
-
-
Customer Escalated
-
Description of problem:
It was identified that during the cluster upgrade from 4.12.8 to 4.12.30 the tuned daemon pods keeps changing RSS queue length of the i40e ethernet adapter The following kernel event was triggered at the time of RSS queue length change # dmesg | tail [5200411.339063] i40e 0000:63:00.0: User requested queue count/HW max RSS count: 48/64 Because of the ethernet queue length change the DPDK application stops responding completely. Important to note that, such issue is not seen while upgrading the cluster from 4.12.30 - 4.12.31. Need to know the reason why tuned attempts to change the RSS queue length during the cluster upgrade from 4.12.8 to 4.12.30. We observed that when the tuned cluster operator is upgraded, it immediately recreates the tuned daemon POD from fresh.
Version-Release number of selected component (if applicable):
$ oc get clusterversion NAME VERSION AVAILABLE PROGRESSING SINCE STATUS version 4.12.8 True True 36m
How reproducible:
- Make sure i40e VF enabled ethernet adapter is present on the system - Make sure the cluster version is 4.12.8 and this will be upgraded to 4.12.30 - Apply the performance profile as per the doc[1]. apiVersion: performance.openshift.io/v2 kind: PerformanceProfile metadata: name: manual spec: cpu: isolated: 3-51,54-103 reserved: 0-2,52-54 net: userLevelNetworking: true nodeSelector: node-role.kubernetes.io/worker-cnf: "" - Upgrade the cluster from 4.12.8 to 4.12.30, as soon as the tuned cluster operator is upgraded, we'll see the following events on all the SR-IOV VF enabled worker nodes. # dmesg | tail [5200411.339063] i40e 0000:63:00.0: User requested queue count/HW max RSS count: 48/64 - As per doc[2], such behavior is found as soon as the ethernet queue length is changed though the ethtool command. userLevelNetworking is a required field specified as a boolean flag. If userLevelNetworking is true, the queue count is set to the reserved CPU count for all supported devices. The default is false. [1] https://docs.openshift.com/container-platform/4.12/scalability_and_performance/cnf-low-latency-tuning.html [2] https://access.redhat.com/solutions/7050357
Additional info:
- relates to
-
OCPBUGS-15803 TuneD reverts node level profiles on termination
- Closed