Uploaded image for project: 'OpenShift Bugs'
  1. OpenShift Bugs
  2. OCPBUGS-41934

tuned profile got degraded after node reboot

XMLWordPrintable

    • Quality / Stability / Reliability
    • False
    • Hide

      None

      Show
      None
    • None
    • Important
    • None
    • Hide
      2025-09-07: The last RHEL (tuned) fix has significantly reduce the occurrence of the issue but it is not 100% solved. due to that and with the latest reproducer we got from a 4.18.19 testing environment , RHEL-60906 will be reopened to try and fully solve the issue.
      Moving this bug again to POST and as a dependent on RHEL-60906, once the fix will be done and verified on RHEL-60906 we will move this OCP back to ON_QA for robust verification on 4.18 and 4.16.


      2025-06-03:
      a fix for the amended tuned is delivered via FDP 25.C , once it will be build in NTO and included in openshift we will move the bug back to ON_QA

      2025-04-26: https://issues.redhat.com/browse/RHEL-88238 opened for TuneD-OCP handling which is not covered by https://github.com/redhat-performance/tuned/pull/762 but uses the same concepts.

      2025-04-02: The previous attempt to solve this wasn't fully working, and bug is still reproducable.
      A second effort has been made in tuned,
       https://github.com/redhat-performance/tuned/pull/762

      2024-12-02: This bug is kept to track and test RHEL-60906 once lands in FDP and its matching OCP version.

      Show
      2025-09-07: The last RHEL (tuned) fix has significantly reduce the occurrence of the issue but it is not 100% solved. due to that and with the latest reproducer we got from a 4.18.19 testing environment , RHEL-60906 will be reopened to try and fully solve the issue. Moving this bug again to POST and as a dependent on RHEL-60906 , once the fix will be done and verified on RHEL-60906 we will move this OCP back to ON_QA for robust verification on 4.18 and 4.16. 2025-06-03: a fix for the amended tuned is delivered via FDP 25.C , once it will be build in NTO and included in openshift we will move the bug back to ON_QA 2025-04-26: https://issues.redhat.com/browse/RHEL-88238 opened for TuneD-OCP handling which is not covered by https://github.com/redhat-performance/tuned/pull/762 but uses the same concepts. 2025-04-02: The previous attempt to solve this wasn't fully working, and bug is still reproducable. A second effort has been made in tuned,   https://github.com/redhat-performance/tuned/pull/762 2024-12-02: This bug is kept to track and test RHEL-60906 once lands in FDP and its matching OCP version.
    • None
    • None
    • Rejected
    • CNF Compute Sprint 260, CNF Compute Sprint 261, CNF Compute Sprint 262, CNF Compute Sprint 263, CNF Compute Sprint 264, CNF Compute Sprint 265, CNF Compute Sprint 266, CNF Compute Sprint 267, CNF Compute Sprint 268, CNF Compute Sprint 269, CNF Compute Sprint 270, CNF Compute Sprint 271, CNF Compute Sprint 272, CNF Compute Sprint 273, CNF Compute Sprint 274, CNF Compute Sprint 275, CNF Compute Sprint 276, CNF Compute Sprint 277
    • 18
    • Done
    • Known Issue
    • Hide
      * Currently, on clusters with SR-IOV network virtual functions configured, a race condition might occur between system services responsible for network device renaming and the TuneD service managed by the Node Tuning Operator. As a consequence, the TuneD profile might become degraded after the node restarts, leading to performance degradation. As a workaround, restart the TuneD pod to restore the profile state. (link:https://issues.redhat.com/browse/OCPBUGS-41934[*OCPBUGS-41934*])
      Show
      * Currently, on clusters with SR-IOV network virtual functions configured, a race condition might occur between system services responsible for network device renaming and the TuneD service managed by the Node Tuning Operator. As a consequence, the TuneD profile might become degraded after the node restarts, leading to performance degradation. As a workaround, restart the TuneD pod to restore the profile state. (link: https://issues.redhat.com/browse/OCPBUGS-41934 [* OCPBUGS-41934 *])
    • None
    • None
    • None
    • None

      Description of problem:

      On a SNO node which has RAN profile enabled (many SriovNetworks), tuned profile got degraded after the node reboot due to error below:
      
      Message:               TuneD daemon issued one or more error message(s) during profile application. TuneD stderr:  ERROR    tuned.utils.commands: Executing 'ethtool -l ens2f0v1' error: netlink error: no device matches name (offset 24)

      Version-Release number of selected component (if applicable):

      4.16.10, maybe other 4.16 versions as well

      How reproducible:

      Looks like a race condition issue, if there are more PF/VFs in the SriovNetwork, the issue will happen more often.

      Steps to Reproduce:

          1. Install a SNO cluster with RAN profile applied, usetLevelNetworking is enabled in PerformanceProfile.
          2. Make sure many SriovNetworkNodePolicy are created on the cluster
          3. Reboot the cluster and check profile: oc get profile -A
          

      Actual results:

      Sometime profile got degraded, when running 'oc describe profile -A', got error like: 
      
      Message:               TuneD daemon issued one or more error message(s) during profile application. TuneD stderr:  ERROR    tuned.utils.commands: Executing 'ethtool -l ens2f0v1' error: netlink error: no device matches name (offset 24)

       

      Expected results:

      Profile should not be degraded

      Additional info:

      When restarting tuned pod it cleared the issue.
      
      Attach the sriov CRs so to reproduce the issue: https://drive.google.com/file/d/1qIjF-fXJeBcu_esp8P-kT9RQ6OJ6Zp_i/view?usp=drive_link

              yquinn@redhat.com Yanir Quinn
              bzhai@redhat.com XIAOBO ZHAI
              None
              None
              Mallapadi Niranjan Mallapadi Niranjan
              None
              Votes:
              0 Vote for this issue
              Watchers:
              24 Start watching this issue

                Created:
                Updated: