Uploaded image for project: 'OpenShift Bugs'
  1. OpenShift Bugs
  2. OCPBUGS-63321

PerformanceProfile stuck degraded "BadMachineConfigLabels"

XMLWordPrintable

    • Icon: Bug Bug
    • Resolution: Unresolved
    • Icon: Undefined Undefined
    • None
    • 4.18.z
    • Node Tuning Operator
    • None
    • Quality / Stability / Reliability
    • False
    • Hide

      None

      Show
      None
    • None
    • Important
    • None
    • None
    • None
    • None
    • None
    • None
    • None
    • None
    • None
    • None
    • None
    • None

      Description of problem:

      During a cluster SNO CGU upgrade, the worker PerformanceProfile was in a Degraded state with the error:

      $ omc get performanceprofile master-profile -o yaml
      ...
      machineConfigPoolSelector:
      pools.operator.machineconfiguration.openshift.io/master: ""
      nodeSelector:
      node-role.kubernetes.io/master: ""
      ...
      
      lastHeartbeatTime: "2025-10-15T09:38:12Z"
      lastTransitionTime: "2025-10-15T09:38:12Z"
      message: the MachineConfigPool "master" does not have any labels that can be used to bind it together with KubeletConfig
      reason: BadMachineConfigLabels
      status: "True"
      type: Degraded
      ...
      
      $ omc get mcp
      NAME     CONFIG                                             UPDATED   UPDATING   DEGRADED   MACHINECOUNT   READYMACHINECOUNT   UPDATEDMACHINECOUNT   DEGRADEDMACHINECOUNT   AGE
      master   rendered-master-29ea7f8a941960dfc728072d933ae532   True      False      False      1              1                   1                     0                      5h
      worker   rendered-worker-11c0c71fc1056d589ef44207f9515356   True      False      False      0              0                   0                     0                      5h
      

      Based on the code https://github.com/openshift/cluster-node-tuning-operator/blob/dbb384039d22b64a080cb114df5cde7be1effb42/pkg/performanceprofile/controller/performanceprofile_controller.go#L581
      the error occurs when len(profileMCP.Labels) == 0, meaning the MachineConfigPool has no labels in metadata.labels. However, this MCP actually has labels, so the configuration should reconcile

      $ omc get mcp --show-labels
      NAME     CONFIG                                             UPDATED   UPDATING   DEGRADED   MACHINECOUNT   READYMACHINECOUNT   UPDATEDMACHINECOUNT   DEGRADEDMACHINECOUNT   AGE   LABELS
      master   rendered-master-29ea7f8a941960dfc728072d933ae532   True      False      False      1              1                   1                     0                      5h    machineconfiguration.openshift.io/mco-built-in=,operator.machineconfiguration.openshift.io/required-for-upgrade=,pools.operator.machineconfiguration.openshift.io/master=
      worker   rendered-worker-11c0c71fc1056d589ef44207f9515356   True      False      False      0              0                   0                     0                      5h    machineconfiguration.openshift.io/mco-built-in=,pools.operator.machineconfiguration.openshift.io/worker=
      

      Operator logs:

      $ omc logs cluster-node-tuning-operator-759544dd89-zq975 -n openshift-cluster-node-tuning-operator
      ...
      2025-10-15T09:32:58.647515086Z I1015 09:32:58.647486 1 leaderelection.go:254] attempting to acquire leader lease openshift-cluster-node-tuning-operator/node-tuning-operator-lock...
      2025-10-15T09:38:12.288248812Z I1015 09:38:12.288212 1 leaderelection.go:268] successfully acquired lease openshift-cluster-node-tuning-operator/node-tuning-operator-lock
      2025-10-15T09:38:12.288429065Z I1015 09:38:12.288380 1 controller.go:1322] starting Tuned controller
      2025-10-15T09:38:12.288558723Z {"level":"info","ts":"2025-10-15T09:38:12Z","msg":"Starting EventSource","controller":"performanceprofile","controllerGroup":"performance.openshift.io","controllerKind":"PerformanceProfile","source":"kind source: *v2.PerformanceProfile"}
      2025-10-15T09:38:12.288577121Z {"level":"info","ts":"2025-10-15T09:38:12Z","msg":"Starting EventSource","controller":"performanceprofile","controllerGroup":"performance.openshift.io","controllerKind":"PerformanceProfile","source":"kind source: *v1.MachineConfig"}
      2025-10-15T09:38:12.288583643Z {"level":"info","ts":"2025-10-15T09:38:12Z","msg":"Starting EventSource","controller":"performanceprofile","controllerGroup":"performance.openshift.io","controllerKind":"PerformanceProfile","source":"kind source: *v1.KubeletConfig"}
      2025-10-15T09:38:12.288590245Z {"level":"info","ts":"2025-10-15T09:38:12Z","msg":"Starting EventSource","controller":"performanceprofile","controllerGroup":"performance.openshift.io","controllerKind":"PerformanceProfile","source":"kind source: *v1.Tuned"}
      2025-10-15T09:38:12.288590245Z {"level":"info","ts":"2025-10-15T09:38:12Z","msg":"Starting EventSource","controller":"performanceprofile","controllerGroup":"performance.openshift.io","controllerKind":"PerformanceProfile","source":"kind source: *v1.RuntimeClass"}
      2025-10-15T09:38:12.288597254Z {"level":"info","ts":"2025-10-15T09:38:12Z","msg":"Starting EventSource","controller":"performanceprofile","controllerGroup":"performance.openshift.io","controllerKind":"PerformanceProfile","source":"kind source: *v1.MachineConfigPool"}
      2025-10-15T09:38:12.288603532Z {"level":"info","ts":"2025-10-15T09:38:12Z","msg":"Starting EventSource","controller":"performanceprofile","controllerGroup":"performance.openshift.io","controllerKind":"PerformanceProfile","source":"kind source: *v1.Profile"}
      2025-10-15T09:38:12.288603532Z {"level":"info","ts":"2025-10-15T09:38:12Z","msg":"Starting Controller","controller":"performanceprofile","controllerGroup":"performance.openshift.io","controllerKind":"PerformanceProfile"}
      2025-10-15T09:38:12.488898347Z I1015 09:38:12.488869 1 controller.go:1443] started events processor/controller
      2025-10-15T09:38:12.496100626Z I1015 09:38:12.496038 1 server.go:104] starting metrics server
      2025-10-15T09:38:12.613915258Z {"level":"info","ts":"2025-10-15T09:38:12Z","msg":"Starting workers","controller":"performanceprofile","controllerGroup":"performance.openshift.io","controllerKind":"PerformanceProfile","worker count":1}
      ...
      

      Workaround applied: restart the NTO operator pod

      $ oc delete pod -n openshift-cluster-node-tuning-operator cluster-node-tuning-operator-56678557f-lmvzl
      

      Version-Release number of selected component:

      $ omc get clusterversion
      NAME      VERSION   AVAILABLE   PROGRESSING   SINCE   STATUS
      version   4.18.22   True        False         2h50m   Cluster version is 4.18.22
      
      $ omc get nodes
      NAME STATUS ROLES AGE VERSION
      master0 Ready control-plane,master,worker 5h v1.31.11
      

      How reproducible:

      The partner is trying to reproduce the issue but cannot right now and is checking whether the labels are being really removed. In my lab, I can reproduce the same behavior by deleting the MCP labels; the NTO reports "BadMachineConfigLabels" and after re-adding the labels, the NTO does not update its status

      Steps to Reproduce:

      1. Scale down the machine-config controller/operator deployments.
      2. Remove MCP labels.
      3. Delete the NTO operator pod.
      4. Check the PerformanceProfile status.

      Actual results:
      The PerformanceProfile shows the error "BadMachineConfigLabels"

      Expected results:
      The PerformanceProfile should reconcile the configuration without needing to restart the NTO operator pod.

      Additional info:
      Reviewing the code, it reconciles when the MCP Status.Conditions change, but during label changes this does not change, and the NTO does not reconcile the configuration
      https://github.com/openshift/cluster-node-tuning-operator/blob/dbb384039d22b64a080cb114df5cde7be1effb42/pkg/performanceprofile/controller/performanceprofile_controller.go#L129

      mcpPredicates := predicate.Funcs{
          UpdateFunc: func(e event.UpdateEvent) bool {
              if !validateUpdateEvent(e.ObjectOld, e.ObjectNew) {
                  return false
              }
      
              mcpOld := e.ObjectOld.(*mcov1.MachineConfigPool)
              mcpNew := e.ObjectNew.(*mcov1.MachineConfigPool)
      
              return !reflect.DeepEqual(mcpOld.Status.Conditions, mcpNew.Status.Conditions)
          },
      }
      

      Logs attached here: https://drive.google.com/drive/folders/187rqVwVBwFCowrPlpbyRD6Z0yiXNHRJZ?usp=drive_link

              team-nto Team NTO
              rhn-support-jclaretm Jorge Claret Membrado
              None
              None
              Mallapadi Niranjan Mallapadi Niranjan
              None
              Votes:
              0 Vote for this issue
              Watchers:
              7 Start watching this issue

                Created:
                Updated: