Uploaded image for project: 'OpenShift Bugs'
  1. OpenShift Bugs
  2. OCPBUGS-77659

Unavailable node after applying a performance profile with strict cpu reservation and isolated/reserved cpus

    • Icon: Bug Bug
    • Resolution: Unresolved
    • Icon: Major Major
    • None
    • 4.20
    • Node / Kubelet
    • None
    • None
    • False
    • Hide

      None

      Show
      None
    • None
    • Important
    • None
    • None
    • None
    • None
    • None
    • None
    • None
    • None

      Description of problem:

      This issue has been seen as part of the porting of OVS-DPDK with OCP Virt. It can be reproduced with a simple OCP 4.20 installation.
      Requesting isolated and reserved cores, with strict cpu reservation ends up with the worker node to be locked up. Only solution is to log on the node, and remove the /var/lib/kubelet/cpu_manager_state file.

      Version-Release number of selected component (if applicable):

      $ oc version
      Client Version: 4.13.7
      Kustomize Version: v4.5.7
      Server Version: 4.20.0-0.nightly-2026-03-02-014334
      Kubernetes Version: v1.33.8
      $ oc get deployment cluster-node-tuning-operator -n openshift-cluster-node-tuning-operator -o jsonpath='{.spec.template.spec.containers[0].image}{"\n"}'
      quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:e8ff6bfeed2963b7955452fe752dd15f903b7415fdf017f5c1e6d2a290b39f26
      

      How reproducible: 100%

      Steps to Reproduce:
      1.

      $ cat pp.yaml                                                                                       
      apiVersion: performance.openshift.io/v2                                                             
      kind: PerformanceProfile                                                                            
      metadata:                                                                                           
        labels:                                                                                           
          machineconfiguration.openshift.io/role: worker                                                  
        name: ovs-dpdk-worker                                                                             
        annotations:                                                                                      
          kubeletconfig.experimental: |                                                                   
            {"cpuManagerPolicyOptions": {"full-pcpus-only": "true", "strict-cpu-reservation": "true"}}    
      spec:                                                                                               
        additionalKernelArgs:                                                                             
          - "enforcing=0"                                                                                 
          - "br-phys-bind=enp2s0"                                                                         
        cpu:                                                                                              
          isolated: "2-11,14-23"                                                                          
          reserved: "0-1,12-13"                                                                           
        hugepages:                                                                                        
          defaultHugepagesSize: "2M"                                                                      
          pages:                                                                                          
            - size: "2M"                                                                                  
              count: 8192                                                                                 
        nodeSelector:                                                                                     
          node-role.kubernetes.io/worker: ''                                                              
        numa:                                                                                             
          # Ref.: https://kubernetes.io/docs/tasks/administer-cluster/topology-manager/                   
          topologyPolicy: "restricted"                                                                    
      $ oc apply -f pp.yaml                                                                               
      

      Actual results:

      $ oc get nodes -w                                                                                   
      ...                                                                                                 
      w0     NotReady,SchedulingDisabled   worker                 106m   v1.33.8                          
      

      Expected results:

      $ oc get nodes -w                                                                                   
      ...                                                                                                 
      w0     Ready                          worker                 106m   v1.33.8                          
      

      Additional info:

      Before applying the pp, some logs were gathered:

      [root@w0 ~]# cat /proc/cmdline                                                                      
      BOOT_IMAGE=(hd0,gpt3)/boot/ostree/rhcos-8ffde1af9e76874598344c2754660ead6f2f451919c716a81fadf450d2011ef7/vmlinuz-5.14.0-570.95.1.el9_6.x86_64 rw ostree=/ostree/boot.0/rhcos/8ffde1af9e76874598344c2754660ead6f2f451919c716a81fadf450d2011ef7/0 ignition.platform.id=metal ip=dhcp root=UUID=02d81ee3-daa3-43f7-ae88-a22c50a045e3 rw rootflags=prjquota boot=UUID=98e1c771-1117-4433-976a-61c1e58572d7 systemd.unified_cgroup_hierarchy=1 cgroup_no_v1=all psi=0 irqpoll console=tty0 console=ttyS0,115200 earlyprintk=ttyS0,115200
                                                                                                          
      [root@w0 ~]# taskset -pc 1                                                                          
      pid 1's current affinity list: 0-23                                                                 
                                                                                                          
      [root@w0 ~]# cat /var/lib/kubelet/cpu_manager_state                                                 
      {"policyName":"none","defaultCpuSet":"","checksum":1353318690}      
      

      After applying the pp, the node got rebooted, and then:

      
      $ oc get nodes -w
      ...
      w0     NotReady,SchedulingDisabled   worker                 106m   v1.33.8
      
      $ oc get pods -A -o wide | grep machine-config.*w0
      openshift-machine-config-operator                  kube-rbac-proxy-crio-w0                                      1/1     Running     3             107m   192.168.158.32   w0       <none>           <none>
      openshift-machine-config-operator                  machine-config-daemon-hq6ml                                  2/2     Running     2             107m   192.168.158.32   w0       <none>           <none>
      
      $ oc logs -n openshift-machine-config-operator -f machine-config-daemon-hq6ml
      Defaulted container "machine-config-daemon" out of: machine-config-daemon, kube-rbac-proxy
      Error from server: Get "https://192.168.158.32:10250/containerLogs/openshift-machine-config-operator/machine-config-daemon-hq6ml/machine-config-daemon?follow=true": dial tcp 192.168.158.32:10250: connect: connection refused
      
      [root@w0 ~]# cat /proc/cmdline 
      BOOT_IMAGE=(hd0,gpt3)/boot/ostree/rhcos-8ffde1af9e76874598344c2754660ead6f2f451919c716a81fadf450d2011ef7/vmlinuz-5.14.0-570.95.1.el9_6.x86_64 rw ostree=/ostree/boot.0/rhcos/8ffde1af9e76874598344c2754660ead6f2f451919c716a81fadf450d2011ef7/0 ignition.platform.id=metal ip=dhcp root=UUID=02d81ee3-daa3-43f7-ae88-a22c50a045e3 rw rootflags=prjquota boot=UUID=98e1c771-1117-4433-976a-61c1e58572d7 skew_tick=1 tsc=reliable rcupdate.rcu_normal_after_boot=1 rcutree.nohz_full_patience_delay=1000 nohz=on rcu_nocbs=2-11,14-23 tuned.non_isolcpus=00003003 systemd.cpu_affinity=0,1,12,13 intel_iommu=on iommu=pt isolcpus=managed_irq,2-11,14-23 nohz_full=2-11,14-23 tsc=reliable nosoftlockup nmi_watchdog=0 mce=off skew_tick=1 rcutree.kthread_prio=11 default_hugepagesz=2M hugepagesz=2M hugepages=8192 intel_pstate=disable enforcing=0 br-phys-bind=enp2s0 systemd.unified_cgroup_hierarchy=1 cgroup_no_v1=all psi=0 irqpoll console=tty0 console=ttyS0,115200 earlyprintk=ttyS0,115200
      
      [root@w0 ~]# taskset -pc 1
      pid 1's current affinity list: 0,1,12,13
      
      [root@w0 ~]# cat /var/lib/kubelet/cpu_manager_state 
      {"policyName":"static","defaultCpuSet":"2-11,14-23","checksum":221304858}[root@w0 ~]# 
      

      We can see the first time kubelet started ended with an error:

      [root@w0 ~]# journalctl -b 0
      Mar 03 13:29:51 localhost kernel: Linux version 5.14.0-570.95.1.el9_6.x86_64 (mockbuild@x86-64-04.build.eng.rdu2.redhat.com) (gcc (GCC) 11.5.0 20240719 (Red Hat 11.5.0-5), GNU ld version 2.35.2-63.el9_6.1) #1 SMP PREEMPT_DYNAMIC Thu Feb >
      ...
      Mar 03 13:31:04 w0 crio[3260]: time="2026-03-03T13:31:04.755534694Z" level=info msg="Successfully cleaned up network for pod 781ae9671900921a2bfbe2172b1351712753cab94f1b3a8451ecd4de0ab37a5c"
      Mar 03 13:31:04 w0 kubenswrapper[3323]: Flag --container-runtime-endpoint has been deprecated, This parameter should be set via the config file specified by the Kubelet's --config flag. See https://kubernetes.io/docs/tasks/administer-clus
      ter/kubelet-config-file/ for more information.
      ...
      Mar 03 13:31:04 w0 kubenswrapper[3323]: I0303 13:31:04.828497    3323 flags.go:64] FLAG: --cpu-manager-policy="none"
      Mar 03 13:31:04 w0 kubenswrapper[3323]: I0303 13:31:04.828503    3323 flags.go:64] FLAG: --cpu-manager-policy-options=""
      Mar 03 13:31:04 w0 kubenswrapper[3323]: I0303 13:31:04.828512    3323 flags.go:64] FLAG: --cpu-manager-reconcile-period="10s"
      ...
      Mar 03 13:31:04 w0 kubenswrapper[3323]: I0303 13:31:04.910261    3323 container_manager_linux.go:306] "Creating device plugin manager"
      Mar 03 13:31:04 w0 kubenswrapper[3323]: I0303 13:31:04.910272    3323 manager.go:141] "Creating Device Plugin manager" path="/var/lib/kubelet/device-plugins/kubelet.sock"
      Mar 03 13:31:04 w0 kubenswrapper[3323]: I0303 13:31:04.910302    3323 server.go:72] "Creating device plugin registration server" version="v1beta1" socket="/var/lib/kubelet/device-plugins/kubelet.sock"
      Mar 03 13:31:04 w0 kubenswrapper[3323]: I0303 13:31:04.910660    3323 cpu_manager.go:179] "Detected CPU topology" topology={"NumCPUs":24,"NumCores":24,"NumUncoreCache":24,"NumSockets":24,"NumNUMANodes":1,"CPUDetails":{"0":{"NUMANodeID":0,"SocketID":0,"CoreID":0,"UncoreCacheID":0},"1":{"NUMANodeID":0,"SocketID":1,"CoreID":1,"UncoreCacheID":1},"10":{"NUMANodeID":0,"SocketID":10,"CoreID":10,"UncoreCacheID":10},"11":{"NUMANodeID":0,"SocketID":11,"CoreID":11,"UncoreCacheID":11},"12":{"NUMANodeID":0,"SocketID":12,"CoreID":12,"UncoreCacheID":12},"13":{"NUMANodeID":0,"SocketID":13,"CoreID":13,"UncoreCacheID":13},"14":{"NUMANodeID":0,"SocketID":14,"CoreID":14,"UncoreCacheID":14},"15":{"NUMANodeID":0,"SocketID":15,"CoreID":15,"UncoreCacheID":15},"16":{"NUMANodeID":0,"SocketID":16,"CoreID":16,"UncoreCacheID":16},"17":{"NUMANodeID":0,"SocketID":17,"CoreID":17,"UncoreCacheID":17},"18":{"NUMANodeID":0,"SocketID":18,"CoreID":18,"UncoreCacheID":18},"19":{"NUMANodeID":0,"SocketID":19,"CoreID":19,"UncoreCacheID":19},"2":{"NUMANodeID":0,"SocketID":2,"CoreID":2,"UncoreCacheID":2},"20":{"NUMANodeID":0,"SocketID":20,"CoreID":20,"UncoreCacheID":20},"21":{"NUMANodeID":0,"SocketID":21,"CoreID":21,"UncoreCacheID":21},"22":{"NUMANodeID":0,"SocketID":22,"CoreID":22,"UncoreCacheID":22},"23":{"NUMANodeID":0,"SocketID":23,"CoreID":23,"UncoreCacheID":23},"3":{"NUMANodeID":0,"SocketID":3,"CoreID":3,"UncoreCacheID":3},"4":{"NUMANodeID":0,"SocketID":4,"CoreID":4,"UncoreCacheID":4},"5":{"NUMANodeID":0,"SocketID":5,"CoreID":5,"UncoreCacheID":5},"6":{"NUMANodeID":0,"SocketID":6,"CoreID":6,"UncoreCacheID":6},"7":{"NUMANodeID":0,"SocketID":7,"CoreID":7,"UncoreCacheID":7},"8":{"NUMANodeID":0,"SocketID":8,"CoreID":8,"UncoreCacheID":8},"9":{"NUMANodeID":0,"SocketID":9,"CoreID":9,"UncoreCacheID":9}}}
      Mar 03 13:31:04 w0 kubenswrapper[3323]: I0303 13:31:04.910740    3323 policy_static.go:145] "Static policy created with configuration" options={"FullPhysicalCPUsOnly":true,"DistributeCPUsAcrossNUMA":false,"AlignBySocket":false,"DistributeCPUsAcrossCores":false,"StrictCPUReservation":true,"PreferAlignByUncoreCacheOption":false} cpuGroupSize=1
      Mar 03 13:31:04 w0 kubenswrapper[3323]: I0303 13:31:04.910801    3323 policy_static.go:182] "Reserved CPUs not available for exclusive assignment" reservedSize=4 reserved="0-1,12-13" reservedPhysicalCPUs="0-1,12-13"
      Mar 03 13:31:04 w0 kubenswrapper[3323]: I0303 13:31:04.910820    3323 state_mem.go:36] "Initialized new in-memory state store"
      Mar 03 13:31:04 w0 kubenswrapper[3323]: I0303 13:31:04.910996    3323 server.go:1267] "Using root directory" path="/var/lib/kubelet"
      ...
      Mar 03 13:31:04 w0 systemd[1]: Startup finished in 1.752s (kernel) + 3.093s (initrd) + 1min 10.718s (userspace) = 1min 15.564s.
      Mar 03 13:31:04 w0 kubenswrapper[3323]: I0303 13:31:04.978041    3323 manager.go:324] Recovery completed
      Mar 03 13:31:04 w0 kubenswrapper[3323]: I0303 13:31:04.986759    3323 kubelet_node_status.go:413] "Setting node annotation to enable volume controller attach/detach"
      Mar 03 13:31:04 w0 kubenswrapper[3323]: I0303 13:31:04.987607    3323 kubelet_node_status.go:736] "Recording event message for node" node="w0" event="NodeHasSufficientMemory"
      Mar 03 13:31:04 w0 kubenswrapper[3323]: I0303 13:31:04.987632    3323 kubelet_node_status.go:736] "Recording event message for node" node="w0" event="NodeHasNoDiskPressure"
      Mar 03 13:31:04 w0 kubenswrapper[3323]: I0303 13:31:04.987650    3323 kubelet_node_status.go:736] "Recording event message for node" node="w0" event="NodeHasSufficientPID"
      Mar 03 13:31:04 w0 kubenswrapper[3323]: I0303 13:31:04.995779    3323 cpu_manager.go:222] "Starting CPU manager" policy="static"
      Mar 03 13:31:04 w0 kubenswrapper[3323]: I0303 13:31:04.995800    3323 cpu_manager.go:223] "Reconciling" reconcilePeriod="5s"
      Mar 03 13:31:04 w0 kubenswrapper[3323]: I0303 13:31:04.995835    3323 state_mem.go:36] "Initialized new in-memory state store"
      Mar 03 13:31:04 w0 kubenswrapper[3323]: I0303 13:31:04.998085    3323 state_mem.go:88] "Updated default CPUSet" cpuSet="2-11,14-23"
      Mar 03 13:31:04 w0 kubenswrapper[3323]: I0303 13:31:04.999675    3323 policy_static.go:218] "Static policy initialized" defaultCPUSet="2-11,14-23"
      Mar 03 13:31:04 w0 kubenswrapper[3323]: I0303 13:31:04.999789    3323 memory_manager.go:186] "Starting memorymanager" policy="Static"
      Mar 03 13:31:04 w0 kubenswrapper[3323]: I0303 13:31:04.999812    3323 state_mem.go:35] "Initializing new in-memory state store"
      Mar 03 13:31:05 w0 kubenswrapper[3323]: I0303 13:31:05.001452    3323 state_mem.go:75] "Updated machine memory state"
      Mar 03 13:31:05 w0 systemd[1]: Created slice libcontainer container kubepods.slice.
      Mar 03 13:31:05 w0 kernel: Warning: Unmaintained driver is detected: nft_compat
      Mar 03 13:31:05 w0 kubenswrapper[3323]: E0303 13:31:05.036524    3323 kubelet_node_status.go:515] "Error getting the current node from lister" err="node \"w0\" not found"
      Mar 03 13:31:05 w0 kubenswrapper[3323]: E0303 13:31:05.036698    3323 kubelet.go:1706] "Failed to start ContainerManager" err="failed to initialize top level QOS containers: error validating root container [kubepods] : cgroup [\"kubepods\"] has some missing controllers: cpuset"
      Mar 03 13:31:05 w0 systemd[1]: kubelet.service: Main process exited, code=exited, status=1/FAILURE
      Mar 03 13:31:05 w0 systemd[1]: kubelet.service: Failed with result 'exit-code'.
      

      Followed by a restart, and a different error:

      ...
      Mar 03 13:31:15 w0 systemd[1]: kubelet.service: Scheduled restart job, restart counter is at 1.
      Mar 03 13:31:15 w0 systemd[1]: Stopped Kubernetes Kubelet.
      Mar 03 13:31:15 w0 systemd[1]: Starting Kubernetes Kubelet...
      Mar 03 13:31:15 w0 kubenswrapper[3387]: Flag --container-runtime-endpoint has been deprecated, This parameter should be set via the config file specified by the Kubelet's --config flag. See https://kubernetes.io/docs/tasks/administer-clus
      ter/kubelet-config-file/ for more information.
      ...
      Mar 03 13:31:15 w0 kubenswrapper[3387]: I0303 13:31:15.318200    3387 flags.go:64] FLAG: --cpu-manager-policy="none"
      Mar 03 13:31:15 w0 kubenswrapper[3387]: I0303 13:31:15.318204    3387 flags.go:64] FLAG: --cpu-manager-policy-options=""
      Mar 03 13:31:15 w0 kubenswrapper[3387]: I0303 13:31:15.318213    3387 flags.go:64] FLAG: --cpu-manager-reconcile-period="10s"
      ...
      Mar 03 13:31:15 w0 kubenswrapper[3387]: I0303 13:31:15.349623    3387 container_manager_linux.go:306] "Creating device plugin manager"
      Mar 03 13:31:15 w0 kubenswrapper[3387]: I0303 13:31:15.349632    3387 manager.go:141] "Creating Device Plugin manager" path="/var/lib/kubelet/device-plugins/kubelet.sock"
      Mar 03 13:31:15 w0 kubenswrapper[3387]: I0303 13:31:15.349654    3387 server.go:72] "Creating device plugin registration server" version="v1beta1" socket="/var/lib/kubelet/device-plugins/kubelet.sock"
      Mar 03 13:31:15 w0 kubenswrapper[3387]: I0303 13:31:15.349693    3387 cpu_manager.go:179] "Detected CPU topology" topology={"NumCPUs":24,"NumCores":24,"NumUncoreCache":24,"NumSockets":24,"NumNUMANodes":1,"CPUDetails":{"0":{"NUMANodeID":0,"SocketID":0,"CoreID":0,"UncoreCacheID":0},"1":{"NUMANodeID":0,"SocketID":1,"CoreID":1,"UncoreCacheID":1},"10":{"NUMANodeID":0,"SocketID":10,"CoreID":10,"UncoreCacheID":10},"11":{"NUMANodeID":0,"SocketID":11,"CoreID":11,"UncoreCacheID":11},"12":{"NUMANodeID":0,"SocketID":12,"CoreID":12,"UncoreCacheID":12},"13":{"NUMANodeID":0,"SocketID":13,"CoreID":13,"UncoreCacheID":13},"14":{"NUMANodeID":0,"SocketID":14,"CoreID":14,"UncoreCacheID":14},"15":{"NUMANodeID":0,"SocketID":15,"CoreID":15,"UncoreCacheID":15},"16":{"NUMANodeID":0,"SocketID":16,"CoreID":16,"UncoreCacheID":16},"17":{"NUMANodeID":0,"SocketID":17,"CoreID":17,"UncoreCacheID":17},"18":{"NUMANodeID":0,"SocketID":18,"CoreID":18,"UncoreCacheID":18},"19":{"NUMANodeID":0,"SocketID":19,"CoreID":19,"UncoreCacheID":19},"2":{"NUMANodeID":0,"SocketID":2,"CoreID":2,"UncoreCacheID":2},"20":{"NUMANodeID":0,"SocketID":20,"CoreID":20,"UncoreCacheID":20},"21":{"NUMANodeID":0,"SocketID":21,"CoreID":21,"UncoreCacheID":21},"22":{"NUMANodeID":0,"SocketID":22,"CoreID":22,"UncoreCacheID":22},"23":{"NUMANodeID":0,"SocketID":23,"CoreID":23,"UncoreCacheID":23},"3":{"NUMANodeID":0,"SocketID":3,"CoreID":3,"UncoreCacheID":3},"4":{"NUMANodeID":0,"SocketID":4,"CoreID":4,"UncoreCacheID":4},"5":{"NUMANodeID":0,"SocketID":5,"CoreID":5,"UncoreCacheID":5},"6":{"NUMANodeID":0,"SocketID":6,"CoreID":6,"UncoreCacheID":6},"7":{"NUMANodeID":0,"SocketID":7,"CoreID":7,"UncoreCacheID":7},"8":{"NUMANodeID":0,"SocketID":8,"CoreID":8,"UncoreCacheID":8},"9":{"NUMANodeID":0,"SocketID":9,"CoreID":9,"UncoreCacheID":9}}}
      Mar 03 13:31:15 w0 kubenswrapper[3387]: I0303 13:31:15.349746    3387 policy_static.go:145] "Static policy created with configuration" options={"FullPhysicalCPUsOnly":true,"DistributeCPUsAcrossNUMA":false,"AlignBySocket":false,"DistributeCPUsAcrossCores":false,"StrictCPUReservation":true,"PreferAlignByUncoreCacheOption":false} cpuGroupSize=1
      Mar 03 13:31:15 w0 kubenswrapper[3387]: I0303 13:31:15.349776    3387 policy_static.go:182] "Reserved CPUs not available for exclusive assignment" reservedSize=4 reserved="0-1,12-13" reservedPhysicalCPUs="0-1,12-13"
      ...
      Mar 03 13:31:15 w0 kubenswrapper[3387]: I0303 13:31:15.390130    3387 manager.go:324] Recovery completed
      Mar 03 13:31:15 w0 kubenswrapper[3387]: I0303 13:31:15.396295    3387 kubelet_network_linux.go:49] "Initialized iptables rules." protocol="IPv4"
      Mar 03 13:31:15 w0 kubenswrapper[3387]: I0303 13:31:15.397904    3387 kubelet_node_status.go:413] "Setting node annotation to enable volume controller attach/detach"
      Mar 03 13:31:15 w0 kubenswrapper[3387]: I0303 13:31:15.398420    3387 kubelet_node_status.go:736] "Recording event message for node" node="w0" event="NodeHasSufficientMemory"
      Mar 03 13:31:15 w0 kubenswrapper[3387]: I0303 13:31:15.398461    3387 kubelet_node_status.go:736] "Recording event message for node" node="w0" event="NodeHasNoDiskPressure"
      Mar 03 13:31:15 w0 kubenswrapper[3387]: I0303 13:31:15.398479    3387 kubelet_node_status.go:736] "Recording event message for node" node="w0" event="NodeHasSufficientPID"
      Mar 03 13:31:15 w0 kubenswrapper[3387]: I0303 13:31:15.398981    3387 cpu_manager.go:222] "Starting CPU manager" policy="static"
      Mar 03 13:31:15 w0 kubenswrapper[3387]: I0303 13:31:15.398992    3387 cpu_manager.go:223] "Reconciling" reconcilePeriod="5s"
      Mar 03 13:31:15 w0 kubenswrapper[3387]: I0303 13:31:15.399007    3387 state_mem.go:36] "Initialized new in-memory state store"
      Mar 03 13:31:15 w0 kubenswrapper[3387]: I0303 13:31:15.399127    3387 state_mem.go:88] "Updated default CPUSet" cpuSet="2-11,14-23"
      Mar 03 13:31:15 w0 kubenswrapper[3387]: I0303 13:31:15.399139    3387 state_mem.go:96] "Updated CPUSet assignments" assignments={}
      Mar 03 13:31:15 w0 kubenswrapper[3387]: I0303 13:31:15.399153    3387 state_checkpoint.go:136] "State checkpoint: restored state from checkpoint"
      Mar 03 13:31:15 w0 kubenswrapper[3387]: I0303 13:31:15.399162    3387 state_checkpoint.go:137] "State checkpoint: defaultCPUSet" defaultCpuSet="2-11,14-23"
      Mar 03 13:31:15 w0 kubenswrapper[3387]: E0303 13:31:15.399196    3387 policy_static.go:195] "Static policy invalid state, please drain node and remove policy state file" err="current set of available CPUs \"0-23\" doesn't match with CPUs in state \"2-11,14-23\""
      Mar 03 13:31:15 w0 kubenswrapper[3387]: E0303 13:31:15.399205    3387 cpu_manager.go:239] "Policy start error" err="current set of available CPUs \"0-23\" doesn't match with CPUs in state \"2-11,14-23\""
      Mar 03 13:31:15 w0 kubenswrapper[3387]: E0303 13:31:15.399214    3387 kubelet.go:1706] "Failed to start ContainerManager" err="start cpu manager error: current set of available CPUs \"0-23\" doesn't match with CPUs in state \"2-11,14-23\""
      Mar 03 13:31:15 w0 systemd[1]: kubelet.service: Main process exited, code=exited, status=1/FAILURE
      Mar 03 13:31:15 w0 systemd[1]: kubelet.service: Failed with result 'exit-code'.
      Mar 03 13:31:25 w0 systemd[1]: kubelet.service: Scheduled restart job, restart counter is at 2.
      Mar 03 13:31:25 w0 systemd[1]: Stopped Kubernetes Kubelet.
      Mar 03 13:31:25 w0 systemd[1]: Starting Kubernetes Kubelet...
      Mar 03 13:31:25 w0 kubenswrapper[3417]: Flag --container-runtime-endpoint has been deprecated, This parameter should be set via the config file specified by the Kubelet's --config flag. See https://kubernetes.io/docs/tasks/administer-cluster/kubelet-config-file/ for more information.
      ...
      etc...
      

      The node does not recover after 30 minutes:

      [root@w0 ~]# date
      Tue Mar  3 14:03:33 UTC 2026
      [root@w0 ~]# journalctl -b 0 | grep -c 'Policy start error" err="current set of available CPUs \\"0-23\\"'
      186
      
      [root@w0 ~]# journalctl -f
      ...
      Mar 03 14:04:14 w0 kubenswrapper[9049]: I0303 14:04:14.428085    9049 cpu_manager.go:222] "Starting CPU manager" policy="static"
      Mar 03 14:04:14 w0 kubenswrapper[9049]: I0303 14:04:14.428112    9049 cpu_manager.go:223] "Reconciling" reconcilePeriod="5s"
      Mar 03 14:04:14 w0 kubenswrapper[9049]: I0303 14:04:14.428144    9049 state_mem.go:36] "Initialized new in-memory state store"
      Mar 03 14:04:14 w0 kubenswrapper[9049]: I0303 14:04:14.428419    9049 state_mem.go:88] "Updated default CPUSet" cpuSet="2-11,14-23"
      Mar 03 14:04:14 w0 kubenswrapper[9049]: I0303 14:04:14.428447    9049 state_mem.go:96] "Updated CPUSet assignments" assignments={}
      Mar 03 14:04:14 w0 kubenswrapper[9049]: I0303 14:04:14.428478    9049 state_checkpoint.go:136] "State checkpoint: restored state from checkpoint"
      Mar 03 14:04:14 w0 kubenswrapper[9049]: I0303 14:04:14.428498    9049 state_checkpoint.go:137] "State checkpoint: defaultCPUSet" defaultCpuSet="2-11,14-23"
      Mar 03 14:04:14 w0 kubenswrapper[9049]: E0303 14:04:14.428569    9049 policy_static.go:195] "Static policy invalid state, please drain node and remove policy state file" err="current set of available CPUs \"0-23\" doesn't match with CPUs in state \"2-11,14-23\""
      Mar 03 14:04:14 w0 kubenswrapper[9049]: E0303 14:04:14.428589    9049 cpu_manager.go:239] "Policy start error" err="current set of available CPUs \"0-23\" doesn't match with CPUs in state \"2-11,14-23\""
      Mar 03 14:04:14 w0 kubenswrapper[9049]: E0303 14:04:14.428608    9049 kubelet.go:1706] "Failed to start ContainerManager" err="start cpu manager error: current set of available CPUs \"0-23\" doesn't match with CPUs in state \"2-11,14-23\""
      Mar 03 14:04:14 w0 systemd[1]: kubelet.service: Main process exited, code=exited, status=1/FAILURE
      Mar 03 14:04:14 w0 systemd[1]: kubelet.service: Failed with result 'exit-code'.
      
      

              fromani@redhat.com Francesco Romani
              rhn-support-dmarchan David Marchand
              Niranjan Mallapadi Raghavendra Rao Niranjan Mallapadi Raghavendra Rao
              None
              Votes:
              0 Vote for this issue
              Watchers:
              11 Start watching this issue

                Created:
                Updated: