Loading...

Type: Bug
Resolution: Unresolved
Priority: Undefined
Fix Version/s: None
Affects Version/s: 4.18.z
Component/s: Node Tuning Operator
Labels:
None

Activity Type:
Quality / Stability / Reliability
Blocked:
False
Blocked Reason:

Hide

None

Show
None
Story Points:
None
Severity:
Important
Regression:
None

Target Backport Versions:
None
Target Version:

4.21.0
Release Blocker:
None
Sprint:
None

RH Private Keywords:

SFDC Cases Counter:
SFDC Cases Open:
SFDC Cases Links:

PX Impact Score:

Release Note Status:
None
Release Note Type:
None
Release Note Text:
None

Escape Reason:
None
Escape Impact:
None
Corrective Measures:
None
SDLC stage when should've been found:
None

Description of problem:

During a cluster SNO CGU upgrade, the worker PerformanceProfile was in a Degraded state with the error:

$ omc get performanceprofile master-profile -o yaml
...
machineConfigPoolSelector:
pools.operator.machineconfiguration.openshift.io/master: ""
nodeSelector:
node-role.kubernetes.io/master: ""
...

lastHeartbeatTime: "2025-10-15T09:38:12Z"
lastTransitionTime: "2025-10-15T09:38:12Z"
message: the MachineConfigPool "master" does not have any labels that can be used to bind it together with KubeletConfig
reason: BadMachineConfigLabels
status: "True"
type: Degraded
...

$ omc get mcp
NAME     CONFIG                                             UPDATED   UPDATING   DEGRADED   MACHINECOUNT   READYMACHINECOUNT   UPDATEDMACHINECOUNT   DEGRADEDMACHINECOUNT   AGE
master   rendered-master-29ea7f8a941960dfc728072d933ae532   True      False      False      1              1                   1                     0                      5h
worker   rendered-worker-11c0c71fc1056d589ef44207f9515356   True      False      False      0              0                   0                     0                      5h

Based on the code https://github.com/openshift/cluster-node-tuning-operator/blob/dbb384039d22b64a080cb114df5cde7be1effb42/pkg/performanceprofile/controller/performanceprofile_controller.go#L581
the error occurs when len(profileMCP.Labels) == 0, meaning the MachineConfigPool has no labels in metadata.labels. However, this MCP actually has labels, so the configuration should reconcile

$ omc get mcp --show-labels
NAME     CONFIG                                             UPDATED   UPDATING   DEGRADED   MACHINECOUNT   READYMACHINECOUNT   UPDATEDMACHINECOUNT   DEGRADEDMACHINECOUNT   AGE   LABELS
master   rendered-master-29ea7f8a941960dfc728072d933ae532   True      False      False      1              1                   1                     0                      5h    machineconfiguration.openshift.io/mco-built-in=,operator.machineconfiguration.openshift.io/required-for-upgrade=,pools.operator.machineconfiguration.openshift.io/master=
worker   rendered-worker-11c0c71fc1056d589ef44207f9515356   True      False      False      0              0                   0                     0                      5h    machineconfiguration.openshift.io/mco-built-in=,pools.operator.machineconfiguration.openshift.io/worker=

Operator logs:

$ omc logs cluster-node-tuning-operator-759544dd89-zq975 -n openshift-cluster-node-tuning-operator
...
2025-10-15T09:32:58.647515086Z I1015 09:32:58.647486 1 leaderelection.go:254] attempting to acquire leader lease openshift-cluster-node-tuning-operator/node-tuning-operator-lock...
2025-10-15T09:38:12.288248812Z I1015 09:38:12.288212 1 leaderelection.go:268] successfully acquired lease openshift-cluster-node-tuning-operator/node-tuning-operator-lock
2025-10-15T09:38:12.288429065Z I1015 09:38:12.288380 1 controller.go:1322] starting Tuned controller
2025-10-15T09:38:12.288558723Z {"level":"info","ts":"2025-10-15T09:38:12Z","msg":"Starting EventSource","controller":"performanceprofile","controllerGroup":"performance.openshift.io","controllerKind":"PerformanceProfile","source":"kind source: *v2.PerformanceProfile"}
2025-10-15T09:38:12.288577121Z {"level":"info","ts":"2025-10-15T09:38:12Z","msg":"Starting EventSource","controller":"performanceprofile","controllerGroup":"performance.openshift.io","controllerKind":"PerformanceProfile","source":"kind source: *v1.MachineConfig"}
2025-10-15T09:38:12.288583643Z {"level":"info","ts":"2025-10-15T09:38:12Z","msg":"Starting EventSource","controller":"performanceprofile","controllerGroup":"performance.openshift.io","controllerKind":"PerformanceProfile","source":"kind source: *v1.KubeletConfig"}
2025-10-15T09:38:12.288590245Z {"level":"info","ts":"2025-10-15T09:38:12Z","msg":"Starting EventSource","controller":"performanceprofile","controllerGroup":"performance.openshift.io","controllerKind":"PerformanceProfile","source":"kind source: *v1.Tuned"}
2025-10-15T09:38:12.288590245Z {"level":"info","ts":"2025-10-15T09:38:12Z","msg":"Starting EventSource","controller":"performanceprofile","controllerGroup":"performance.openshift.io","controllerKind":"PerformanceProfile","source":"kind source: *v1.RuntimeClass"}
2025-10-15T09:38:12.288597254Z {"level":"info","ts":"2025-10-15T09:38:12Z","msg":"Starting EventSource","controller":"performanceprofile","controllerGroup":"performance.openshift.io","controllerKind":"PerformanceProfile","source":"kind source: *v1.MachineConfigPool"}
2025-10-15T09:38:12.288603532Z {"level":"info","ts":"2025-10-15T09:38:12Z","msg":"Starting EventSource","controller":"performanceprofile","controllerGroup":"performance.openshift.io","controllerKind":"PerformanceProfile","source":"kind source: *v1.Profile"}
2025-10-15T09:38:12.288603532Z {"level":"info","ts":"2025-10-15T09:38:12Z","msg":"Starting Controller","controller":"performanceprofile","controllerGroup":"performance.openshift.io","controllerKind":"PerformanceProfile"}
2025-10-15T09:38:12.488898347Z I1015 09:38:12.488869 1 controller.go:1443] started events processor/controller
2025-10-15T09:38:12.496100626Z I1015 09:38:12.496038 1 server.go:104] starting metrics server
2025-10-15T09:38:12.613915258Z {"level":"info","ts":"2025-10-15T09:38:12Z","msg":"Starting workers","controller":"performanceprofile","controllerGroup":"performance.openshift.io","controllerKind":"PerformanceProfile","worker count":1}
...

Workaround applied: restart the NTO operator pod

$ oc delete pod -n openshift-cluster-node-tuning-operator cluster-node-tuning-operator-56678557f-lmvzl

Version-Release number of selected component:

$ omc get clusterversion
NAME      VERSION   AVAILABLE   PROGRESSING   SINCE   STATUS
version   4.18.22   True        False         2h50m   Cluster version is 4.18.22

$ omc get nodes
NAME STATUS ROLES AGE VERSION
master0 Ready control-plane,master,worker 5h v1.31.11

How reproducible:

The partner is trying to reproduce the issue but cannot right now and is checking whether the labels are being really removed. In my lab, I can reproduce the same behavior by deleting the MCP labels; the NTO reports "BadMachineConfigLabels" and after re-adding the labels, the NTO does not update its status

Steps to Reproduce:

Scale down the machine-config controller/operator deployments.
Remove MCP labels.
Delete the NTO operator pod.
Check the PerformanceProfile status.

Actual results:
The PerformanceProfile shows the error "BadMachineConfigLabels"

Expected results:
The PerformanceProfile should reconcile the configuration without needing to restart the NTO operator pod.

Additional info:
Reviewing the code, it reconciles when the MCP Status.Conditions change, but during label changes this does not change, and the NTO does not reconcile the configuration
https://github.com/openshift/cluster-node-tuning-operator/blob/dbb384039d22b64a080cb114df5cde7be1effb42/pkg/performanceprofile/controller/performanceprofile_controller.go#L129

mcpPredicates := predicate.Funcs{
    UpdateFunc: func(e event.UpdateEvent) bool {
        if !validateUpdateEvent(e.ObjectOld, e.ObjectNew) {
            return false
        }

        mcpOld := e.ObjectOld.(*mcov1.MachineConfigPool)
        mcpNew := e.ObjectNew.(*mcov1.MachineConfigPool)

        return !reflect.DeepEqual(mcpOld.Status.Conditions, mcpNew.Status.Conditions)
    },
}

Logs attached here: https://drive.google.com/drive/folders/187rqVwVBwFCowrPlpbyRD6Z0yiXNHRJZ?usp=drive_link

blocks

OCPBUGS-65773 PerformanceProfile stuck degraded "BadMachineConfigLabels"

New

is cloned by

OCPBUGS-65773 PerformanceProfile stuck degraded "BadMachineConfigLabels"

New

links to

openshift/cluster-node-tuning-operator#1425: OCPBUGS-63321: Watch MCP changes including spec and labels

Details

Description

Attachments

Issue Links

Easy Agile Planning Poker

Activity

People

Dates