Loading...

XML

Word

Printable

Type: Bug
Resolution: Duplicate
Priority: Major
Fix Version/s: None
Affects Version/s: 4.12
Component/s: Performance Addon Operator
Labels:
None

Activity Type:
Incidents & Support
Blocked:
False
Blocked Reason:

Hide

None

Show
None
Story Points:
5
Severity:
Important
Regression:
No

Target Backport Versions:
None
Target Version:
None
Release Blocker:
None
Sprint:
CNF Compute Sprint 248, CNF Compute Sprint 249, CNF Compute Sprint 250
sprint_count:
3

Customer Impact:

Customer Escalated
RH Private Keywords:

SFDC Cases Counter:
SFDC Cases Open:
SFDC Cases Links:

PX Review Complete:
PX Priority Data:
PX Impact Score:
PX Technical Impact:
PX Impact Range:

Release Note Status:
None
Release Note Type:
None
Release Note Text:
None

Escape Reason:
None
Escape Impact:
None
Corrective Measures:
None
SDLC stage when should've been found:
None

Description of problem:

It was identified that during the cluster upgrade from 4.12.8 to 4.12.30 the tuned daemon pods keeps changing RSS queue length of the i40e ethernet adapter

The following kernel event was triggered at the time of RSS queue length change 

# dmesg | tail
[5200411.339063] i40e 0000:63:00.0: User requested queue count/HW max RSS count:  48/64 


Because of the ethernet queue length change the DPDK application stops responding completely. 

Important to note that, such issue is not seen while upgrading the cluster from 4.12.30 - 4.12.31. 

Need to know the reason why tuned attempts to change the RSS queue length during the cluster upgrade from 4.12.8 to 4.12.30. 

We observed that when the tuned cluster operator is upgraded, it immediately recreates the tuned daemon POD from fresh.

Version-Release number of selected component (if applicable):

$ oc get clusterversion
NAME      VERSION   AVAILABLE   PROGRESSING   SINCE   STATUS
version   4.12.8    True        True          36m

How reproducible:

- Make sure i40e VF enabled ethernet adapter is present on the system 

- Make sure the cluster version is 4.12.8 and this will be upgraded to 4.12.30 

- Apply the performance profile as per the doc[1]. 

apiVersion: performance.openshift.io/v2
kind: PerformanceProfile
metadata:
  name: manual
spec:
  cpu:
    isolated: 3-51,54-103
    reserved: 0-2,52-54
  net:
    userLevelNetworking: true
  nodeSelector:
    node-role.kubernetes.io/worker-cnf: ""

- Upgrade the cluster from 4.12.8 to 4.12.30, as soon as the tuned cluster operator is upgraded, we'll see the following events on all the SR-IOV VF enabled worker nodes. 

# dmesg | tail 
[5200411.339063] i40e 0000:63:00.0: User requested queue count/HW max RSS count: 48/64 

- As per doc[2], such behavior is found as soon as the ethernet queue length is changed though the ethtool command. 


userLevelNetworking is a required field specified as a boolean flag. If userLevelNetworking is true, the queue count is set to the reserved CPU count for all supported devices. The default is false.


[1] https://docs.openshift.com/container-platform/4.12/scalability_and_performance/cnf-low-latency-tuning.html 

[2] https://access.redhat.com/solutions/7050357

Additional info:

relates to

OCPBUGS-15803 TuneD reverts node level profiles on termination

Closed

Assignee:: Martin Sivak

Reporter:: Ramesh Sahoo

Need Info From:: None

Contributors:: William Zhao

QA Contact:: Shereen Haj

Doc Contact:: None

Votes:: 1 Vote for this issue

Watchers:: 15 Start watching this issue

Created:: 2024/01/03 11:31 AM

Updated:: 2025/09/13 12:27 PM

Resolved:: 2024/03/05 12:57 PM

Details

Description

Attachments

Issue Links

Easy Agile Planning Poker

Activity

People

Dates