Uploaded image for project: 'OpenShift Bugs'
  1. OpenShift Bugs
  2. OCPBUGS-25993

tuned operator keeps changing RSS queue length of the i40e ethernet adapter during cluster upgrade

XMLWordPrintable

    • Icon: Bug Bug
    • Resolution: Duplicate
    • Icon: Major Major
    • None
    • 4.12
    • None
    • Important
    • No
    • 5
    • CNF Compute Sprint 248, CNF Compute Sprint 249, CNF Compute Sprint 250
    • 3
    • False
    • Hide

      None

      Show
      None
    • Customer Escalated

      Description of problem:

      It was identified that during the cluster upgrade from 4.12.8 to 4.12.30 the tuned daemon pods keeps changing RSS queue length of the i40e ethernet adapter
      
      The following kernel event was triggered at the time of RSS queue length change 
      
      # dmesg | tail
      [5200411.339063] i40e 0000:63:00.0: User requested queue count/HW max RSS count:  48/64 
      
      
      Because of the ethernet queue length change the DPDK application stops responding completely. 
      
      Important to note that, such issue is not seen while upgrading the cluster from 4.12.30 - 4.12.31. 
      
      Need to know the reason why tuned attempts to change the RSS queue length during the cluster upgrade from 4.12.8 to 4.12.30. 
      
      We observed that when the tuned cluster operator is upgraded, it immediately recreates the tuned daemon POD from fresh.  

      Version-Release number of selected component (if applicable):

      $ oc get clusterversion
      NAME      VERSION   AVAILABLE   PROGRESSING   SINCE   STATUS
      version   4.12.8    True        True          36m   
       
      
      

      How reproducible:

      - Make sure i40e VF enabled ethernet adapter is present on the system 
      
      - Make sure the cluster version is 4.12.8 and this will be upgraded to 4.12.30 
      
      - Apply the performance profile as per the doc[1]. 
      
      apiVersion: performance.openshift.io/v2
      kind: PerformanceProfile
      metadata:
        name: manual
      spec:
        cpu:
          isolated: 3-51,54-103
          reserved: 0-2,52-54
        net:
          userLevelNetworking: true
        nodeSelector:
          node-role.kubernetes.io/worker-cnf: ""
      
      - Upgrade the cluster from 4.12.8 to 4.12.30, as soon as the tuned cluster operator is upgraded, we'll see the following events on all the SR-IOV VF enabled worker nodes. 
      
      # dmesg | tail 
      [5200411.339063] i40e 0000:63:00.0: User requested queue count/HW max RSS count: 48/64 
      
      - As per doc[2], such behavior is found as soon as the ethernet queue length is changed though the ethtool command. 
      
      
      userLevelNetworking is a required field specified as a boolean flag. If userLevelNetworking is true, the queue count is set to the reserved CPU count for all supported devices. The default is false.
      
      
      [1] https://docs.openshift.com/container-platform/4.12/scalability_and_performance/cnf-low-latency-tuning.html 
      
      [2] https://access.redhat.com/solutions/7050357

      Additional info:

              msivak@redhat.com Martin Sivak
              rhn-support-rsahoo Ramesh Sahoo
              Shereen Haj Shereen Haj
              William Zhao
              Votes:
              1 Vote for this issue
              Watchers:
              15 Start watching this issue

                Created:
                Updated:
                Resolved: