Uploaded image for project: 'OpenShift Bugs'
  1. OpenShift Bugs
  2. OCPBUGS-56767

Node Tuning operator fails to start on one or two nodes

XMLWordPrintable

    • Icon: Bug Bug
    • Resolution: Duplicate
    • Icon: Normal Normal
    • None
    • 4.19.0
    • Node Tuning Operator
    • None
    • Quality / Stability / Reliability
    • False
    • Hide

      None

      Show
      None
    • None
    • Moderate
    • None
    • None
    • None
    • None
    • None
    • None
    • None
    • None
    • None
    • None
    • None
    • None

      Description of problem:

          when I reinstall 4.19.0-rc.3 on my 3 node bare metal host cluster, there is always one or two tuned pods fail to start as shown in below :
      # oc get clusteroperator
      NAME                                       VERSION       AVAILABLE   PROGRESSING   DEGRADED   SINCE   MESSAGE
      authentication                             4.19.0-rc.3   True        False         False      12m     
      baremetal                                  4.19.0-rc.3   True        False         False      31m     
      cloud-controller-manager                   4.19.0-rc.3   True        False         False      33m     
      cloud-credential                           4.19.0-rc.3   True        False         False      40m     
      cluster-autoscaler                         4.19.0-rc.3   True        False         False      31m     
      config-operator                            4.19.0-rc.3   True        False         False      32m     
      console                                    4.19.0-rc.3   True        False         False      17m     
      control-plane-machine-set                  4.19.0-rc.3   True        False         False      31m     
      csi-snapshot-controller                    4.19.0-rc.3   True        False         False      31m     
      dns                                        4.19.0-rc.3   True        False         False      31m     
      etcd                                       4.19.0-rc.3   True        False         False      30m     
      image-registry                             4.19.0-rc.3   True        False         False      18m     
      ingress                                    4.19.0-rc.3   True        False         False      21m     
      insights                                   4.19.0-rc.3   True        False         False      31m     
      kube-apiserver                             4.19.0-rc.3   True        False         False      27m     
      kube-controller-manager                    4.19.0-rc.3   True        False         False      27m     
      kube-scheduler                             4.19.0-rc.3   True        False         False      29m     
      kube-storage-version-migrator              4.19.0-rc.3   True        False         False      32m     
      machine-api                                4.19.0-rc.3   True        False         False      28m     
      machine-approver                           4.19.0-rc.3   True        False         False      32m     
      machine-config                             4.19.0-rc.3   True        False         False      30m     
      marketplace                                4.19.0-rc.3   True        False         False      31m     
      monitoring                                 4.19.0-rc.3   True        False         False      13m     
      network                                    4.19.0-rc.3   True        False         False      32m     
      node-tuning                                4.19.0-rc.3   True        True          False      14m     Waiting for 1/3 Profiles to be applied
      olm                                        4.19.0-rc.3   True        False         False      31m     
      openshift-apiserver                        4.19.0-rc.3   True        False         False      22m     
      openshift-controller-manager               4.19.0-rc.3   True        False         False      27m     
      openshift-samples                          4.19.0-rc.3   True        False         False      21m     
      operator-lifecycle-manager                 4.19.0-rc.3   True        False         False      31m     
      operator-lifecycle-manager-catalog         4.19.0-rc.3   True        False         False      31m     
      operator-lifecycle-manager-packageserver   4.19.0-rc.3   True        False         False      22m     
      service-ca                                 4.19.0-rc.3   True        False         False      32m     
      storage                                    4.19.0-rc.3   True        False         False      32m  
      here is the pod log from failed pod:
      I0528 00:27:28.613490   10977 controller.go:1667] starting in-cluster ocp-tuned v4.19.0-202505140813.p0.g7172669.assembly.stream.el9-0-g80d720b-dirty
      I0528 00:27:28.658393   10977 controller.go:671] writing /var/lib/ocp-tuned/image.env
      E0528 00:27:28.661083   10977 controller.go:1712] error repacking the profile: open /etc/tuned/recommend.d/50-openshift.conf: no such file or directory
      I0528 00:27:28.661099   10977 controller.go:1713] deferred updates likely broken
      I0528 00:27:28.661106   10977 controller.go:1729] starting: profile unpacked is "" fingerprint "e3b0c44298fc1c149afbf4c8996fb92427ae41e4649b934ca495991b7852b855"
      I0528 00:27:28.661117   10977 controller.go:1425] recover: no pending deferred change
      I0528 00:27:28.661123   10977 controller.go:1735] starting: no pending deferred update
      I0528 00:27:28.669889   10977 controller.go:382] disabling system tuned...
      I0528 00:27:28.762013   10977 controller.go:1547] started events processors
      I0528 00:27:28.762063   10977 controller.go:1568] monitoring filesystem events on "/etc/tuned/bootcmdline"
      I0528 00:27:28.762070   10977 controller.go:1571] started controller
      I0528 00:27:28.762128   10977 controller.go:359] set log level 0
      I0528 00:27:28.762197   10977 controller.go:614] providerExtract(): extracting cloud provider name to /var/lib/ocp-tuned/provider
      I0528 00:27:28.762359   10977 controller.go:692] tunedRecommendFileWrite(): written "/etc/tuned/recommend.d/50-openshift.conf" to set TuneD profile openshift-control-plane
      I0528 00:27:28.762368   10977 controller.go:417] profilesExtract(): extracting 1 TuneD profiles (recommended=openshift-control-plane)
      I0528 00:27:28.809735   10977 controller.go:462] profilesExtract(): recommended TuneD profile openshift-control-plane content changed [openshift]
      I0528 00:27:28.809870   10977 controller.go:478] profilesExtract(): fingerprint of extracted profiles: "3d0c4b179e02d27e7c0c64d8a4cfe6b3e85ae111f1d1202b6362f4bb92dbc627"
      I0528 00:27:28.809914   10977 controller.go:818] tunedReload()
      I0528 00:27:28.809939   10977 controller.go:745] starting tuned...
      I0528 00:27:28.809948   10977 run.go:121] running cmd...
      2025-05-28 00:27:28,889 INFO     tuned.daemon.application: TuneD: 2.25.1, kernel: 5.14.0-570.16.1.el9_6.x86_64
      2025-05-28 00:27:28,889 INFO     tuned.daemon.application: dynamic tuning is globally disabled
      2025-05-28 00:27:28,891 INFO     tuned.daemon.daemon: using sleep interval of 1 second(s)
      2025-05-28 00:27:28,892 INFO     tuned.daemon.daemon: Running in automatic mode, checking what profile is recommended for your configuration.
      2025-05-28 00:27:28,892 INFO     tuned.daemon.daemon: Using 'openshift-control-plane' profile
      2025-05-28 00:27:28,893 INFO     tuned.profiles.loader: loading profile: openshift-control-plane
      2025-05-28 00:27:28,948 INFO     tuned.daemon.controller: starting controller
      2025-05-28 00:27:28,948 INFO     tuned.daemon.controller: waiting for udev to settle
      Traceback (most recent call last):
        File "/usr/sbin/tuned", line 98, in <module>
          app.run(args.daemon)
        File "/usr/lib/python3.9/site-packages/tuned/daemon/application.py", line 215, in run
          result = self._controller.run()
        File "/usr/lib/python3.9/site-packages/tuned/daemon/controller.py", line 68, in run
          p = monitor.poll(timeout = 1)
        File "/usr/lib/python3.9/site-packages/pyudev/monitor.py", line 354, in poll
          if eintr_retry_call(poll.Poll.for_events((self, 'r')).poll, timeout):
        File "/usr/lib/python3.9/site-packages/pyudev/_util.py", line 159, in eintr_retry_call
          return func(*args, **kwargs)
        File "/usr/lib/python3.9/site-packages/pyudev/_os/poll.py", line 94, in poll
          return list(
        File "/usr/lib/python3.9/site-packages/pyudev/_os/poll.py", line 110, in _parse_events
          raise IOError('Error while polling fd: {0!r}'.format(fd))
      OSError: Error while polling fd: 4
      E0528 00:27:29.390415   10977 controller.go:763] Error while running tuned error waiting for tuned: exit status 1
      I0528 00:37:24.006916   10977 controller.go:359] set log level 0
       
      
      

      the workaround that I figured is to delete the pod, then the pod can start without issue I'm also attaching [^tuned-jv89v-tuned.log]

      which is from a good pod for comparison.

       

      Version-Release number of selected component (if applicable):

          

      How reproducible:

          

      Steps to Reproduce:

          1.
          2.
          3.
          

      Actual results:

          

      Expected results:

          Node tuning 

      Additional info:

       
      

              team-nto Team NTO
              rhn-support-txue Ting Xue
              None
              None
              Liquan Cui Liquan Cui
              None
              Votes:
              0 Vote for this issue
              Watchers:
              3 Start watching this issue

                Created:
                Updated:
                Resolved: