-
Bug
-
Resolution: Unresolved
-
Normal
-
None
-
4.18.z
-
Quality / Stability / Reliability
-
False
-
-
None
-
Moderate
-
None
-
-
None
-
None
-
None
-
CNF Compute Sprint 268, CNF Compute Sprint 269, CNF Compute Sprint 270, CNF Compute Sprint 271, CNF Compute Sprint 272, CNF Compute Sprint 273, CNF Compute Sprint 274, CNF Compute Sprint 275, CNF Compute Sprint 276, CNF Compute Sprint 277
-
10
-
None
-
None
-
None
-
None
-
None
-
None
-
None
Description of problem:
When configuring NROP CR with an MCP that has nodeselector that is shared between nodes from multiple MCPs, RTE pods start also on the non targeted nodes because the way RTE is configured according to the NROP controller is that it creates a daemonset with node selector similar to the one found in the specified MCP in the node group, then this DS creates RTE pods on all nodes that have this selector regardless whether they belong to same specified MCP or not.
Version-Release number of selected component (if applicable):
all
How reproducible:
always
Steps to Reproduce:
setup: - workers > 1 - one of the workers have label node-role.kubernetes.io/worker-cnf: "" - 2 mcps for worker nodes, one is the default worker and the other is called worker-cnf that targets nodes with label node-role.kubernetes.io/worker-cnf: "" 1. configure NROP with node group with "worker" mcp selector
Actual results:
RTE pods will ne created for all nodes that has the label that equals to "worker" mcp node selector shajmakh@shajmakh-thinkpadp16vgen1 ~ $ oc get node,po NAME STATUS ROLES AGE VERSION node/cnfdr11.telco5g.eng.rdu2.redhat.com Ready worker,worker-cnf 23h v1.31.5 node/cnfdr9.telco5g.eng.rdu2.redhat.com Ready worker 23h v1.31.5 node/dhcp-10-1-105-178.telco5g.eng.rdu2.redhat.com Ready control-plane,master,virtual 24h v1.31.5 node/dhcp-10-1-105-221.telco5g.eng.rdu2.redhat.com Ready control-plane,master,virtual 24h v1.31.5 node/dhcp-10-1-105-44.telco5g.eng.rdu2.redhat.com Ready control-plane,master,virtual 24h v1.31.5 NAME READY STATUS RESTARTS AGE pod/numaresources-controller-manager-cdf99bff5-cc4js 1/1 Running 0 22h pod/numaresourcesoperator-worker-5wfr6 2/2 Running 0 4h29m pod/numaresourcesoperator-worker-x8vv9 2/2 Running 0 4h29m
Expected results:
NROP is expected to address machines that only belong to the MCP that is specified under the node group. we need a better way to identify the target nodes
Additional info:
If the NRO CR is configured with the node pool "worker" and also another pool "XXX" (the other pool must have worker role otherwise it's against MCP rules), RTE pods will be doubled for nodes from pool "XXX". Below is an example of how this looks like with a 5-node cluster that has 3 schedulable control-plane and 2 workers: shajmakh@shajmakh-thinkpadp16vgen1 ~/temp-el8 $ oc get pod -o wide NAME READY STATUS RESTARTS AGE IP NODE NOMINATED NODE READINESS GATES numaresources-controller-manager-554c9dffcd-nx8sk 1/1 Running 0 20d 10.134.0.103 dhcp-1-105-44.telco5g.eng.rdu2.redhat.com <none> <none> numaresourcesoperator-master-76qft 2/2 Running 0 7s 10.133.0.48 dhcp-1-105-178.telco5g.eng.rdu2.redhat.com <none> 2/2 numaresourcesoperator-master-7wgv6 2/2 Running 0 7s 10.134.1.125 dhcp-1-105-44.telco5g.eng.rdu2.redhat.com <none> 2/2 numaresourcesoperator-master-x6f97 2/2 Running 0 7s 10.132.0.120 dhcp-1-105-134.telco5g.eng.rdu2.redhat.com <none> 2/2 numaresourcesoperator-worker-9bdsh 2/2 Running 4 20d 10.132.3.167 cnfdr11.telco5g.eng.rdu2.redhat.com <none> 2/2 numaresourcesoperator-worker-9fr9x 2/2 Running 0 2m12s 10.133.0.46 dhcp-1-105-178.telco5g.eng.rdu2.redhat.com <none> 2/2 numaresourcesoperator-worker-bschw 2/2 Running 0 2m12s 10.134.1.123 dhcp-1-105-44.telco5g.eng.rdu2.redhat.com <none> 2/2 numaresourcesoperator-worker-flkm6 2/2 Running 0 2m12s 10.132.0.118 dhcp-1-105-134.telco5g.eng.rdu2.redhat.com <none> 2/2 numaresourcesoperator-worker-jfw69 2/2 Running 4 20d 10.135.1.90 cnfdr9.telco5g.eng.rdu2.redhat.com <none> 2/2 secondary-scheduler-7759995447-f6rsw 1/1 Running 0 3d2h 10.134.1.121 dhcp-1-105-44.telco5g.eng.rdu2.redhat.com <none> <none> shajmakh@shajmakh-thinkpadp16vgen1 ~/temp-el8 $ oc get node NAME STATUS ROLES AGE VERSION cnfdr11.telco5g.eng.rdu2.redhat.com Ready worker 28d v1.31.9 cnfdr9.telco5g.eng.rdu2.redhat.com Ready worker 28d v1.31.9 dhcp-1-105-134.telco5g.eng.rdu2.redhat.com Ready control-plane,master,virtual,worker 28d v1.31.8 dhcp-1-105-178.telco5g.eng.rdu2.redhat.com Ready control-plane,master,virtual,worker 28d v1.31.8 dhcp-1-105-44.telco5g.eng.rdu2.redhat.com Ready control-plane,master,virtual,worker 28d v1.31.8 shajmakh@shajmakh-thinkpadp16vgen1 ~/temp-el8 $ oc get numaresourcesoperator -o yaml apiVersion: v1 items: - apiVersion: nodetopology.openshift.io/v1 kind: NUMAResourcesOperator ... spec: logLevel: Trace nodeGroups: - poolName: worker - poolName: master