Uploaded image for project: 'OpenShift Bugs'
  1. OpenShift Bugs
  2. OCPBUGS-41920

MCPs report wrong number of nodes when we move nodes from one custom MCP to another custom MCP

XMLWordPrintable

    • Moderate
    • None
    • 3
    • False
    • Hide

      None

      Show
      None
    • Release Note Not Required
    • In Progress

      Description of problem:

      When we move one node from one custom MCP to another custom MCP, the MCPs are reporting a wrong number of nodes.
      
      For example, we reach this situation (worker-perf MCP is not reporting the right number of nodes)
      
      $ oc get mcp,nodes
      NAME                                                                     CONFIG                                                         UPDATED   UPDATING   DEGRADED   MACHINECOUNT   READYMACHINECOUNT   UPDATEDMACHINECOUNT   DEGRADEDMACHINECOUNT   AGE
      machineconfigpool.machineconfiguration.openshift.io/master               rendered-master-c8d23b071e1ccf6cf85c7f1b31c0def6               True      False      False      3              3                   3                     0                      142m
      machineconfigpool.machineconfiguration.openshift.io/worker               rendered-worker-36ee1fdc485685ac9c324769889c3348               True      False      False      1              1                   1                     0                      142m
      machineconfigpool.machineconfiguration.openshift.io/worker-perf          rendered-worker-perf-6b5fbffac62c3d437e307e849c44b556          True      False      False      2              2                   2                     0                      24m
      machineconfigpool.machineconfiguration.openshift.io/worker-perf-canary   rendered-worker-perf-canary-6b5fbffac62c3d437e307e849c44b556   True      False      False      1              1                   1                     0                      7m52s
      
      NAME                                             STATUS   ROLES                       AGE    VERSION
      node/ip-10-0-13-228.us-east-2.compute.internal   Ready    worker,worker-perf-canary   138m   v1.30.4
      node/ip-10-0-2-250.us-east-2.compute.internal    Ready    control-plane,master        145m   v1.30.4
      node/ip-10-0-34-223.us-east-2.compute.internal   Ready    control-plane,master        144m   v1.30.4
      node/ip-10-0-35-61.us-east-2.compute.internal    Ready    worker,worker-perf          136m   v1.30.4
      node/ip-10-0-79-232.us-east-2.compute.internal   Ready    control-plane,master        144m   v1.30.4
      node/ip-10-0-86-124.us-east-2.compute.internal   Ready    worker                      139m   v1.30.4
      
      
      
      After 20 minutes or half an hour the MCPs start reporting the right number of nodes
      
          

      Version-Release number of selected component (if applicable):
      IPI on AWS version:

      $ oc get clusterversion
      NAME VERSION AVAILABLE PROGRESSING SINCE STATUS
      version 4.17.0-0.nightly-2024-09-13-040101 True False 124m Cluster version is 4.17.0-0.nightly-2024-09-13-040101

          

      How reproducible:
      Always

          

      Steps to Reproduce:

          1. Create a MCP
          
           oc create -f - << EOF
      apiVersion: machineconfiguration.openshift.io/v1
      kind: MachineConfigPool
      metadata:
        name: worker-perf
      spec:
        machineConfigSelector:
          matchExpressions:
            - {
               key: machineconfiguration.openshift.io/role,
               operator: In,
               values: [worker,worker-perf]
              }
        nodeSelector:
          matchLabels:
            node-role.kubernetes.io/worker-perf: ""
      EOF
      
          
          2. Add 2 nodes to the MCP
          
         $ oc label node $(oc get nodes -l node-role.kubernetes.io/worker -ojsonpath="{.items[0].metadata.name}") node-role.kubernetes.io/worker-perf=
         $ oc label node $(oc get nodes -l node-role.kubernetes.io/worker -ojsonpath="{.items[1].metadata.name}") node-role.kubernetes.io/worker-perf=
      
          3. Create another MCP
          oc create -f - << EOF
      apiVersion: machineconfiguration.openshift.io/v1
      kind: MachineConfigPool
      metadata:
        name: worker-perf-canary
      spec:
        machineConfigSelector:
          matchExpressions:
            - {
               key: machineconfiguration.openshift.io/role,
               operator: In,
               values: [worker,worker-perf,worker-perf-canary]
              }
        nodeSelector:
          matchLabels:
            node-role.kubernetes.io/worker-perf-canary: ""
      EOF
      
          3. Move one node from the MCP created in step 1 to the MCP created in step 3
          $ oc label node $(oc get nodes -l node-role.kubernetes.io/worker -ojsonpath="{.items[0].metadata.name}") node-role.kubernetes.io/worker-perf-canary=
          $ oc label node $(oc get nodes -l node-role.kubernetes.io/worker -ojsonpath="{.items[0].metadata.name}") node-role.kubernetes.io/worker-perf-
          
          
          

      Actual results:

      The worker-perf pool is not reporting the right number of nodes. It continues reporting 2 nodes even though one of them was moved to the worker-perf-canary MCP.
      $ oc get mcp,nodes
      NAME                                                                     CONFIG                                                         UPDATED   UPDATING   DEGRADED   MACHINECOUNT   READYMACHINECOUNT   UPDATEDMACHINECOUNT   DEGRADEDMACHINECOUNT   AGE
      machineconfigpool.machineconfiguration.openshift.io/master               rendered-master-c8d23b071e1ccf6cf85c7f1b31c0def6               True      False      False      3              3                   3                     0                      142m
      machineconfigpool.machineconfiguration.openshift.io/worker               rendered-worker-36ee1fdc485685ac9c324769889c3348               True      False      False      1              1                   1                     0                      142m
      machineconfigpool.machineconfiguration.openshift.io/worker-perf          rendered-worker-perf-6b5fbffac62c3d437e307e849c44b556          True      False      False      2              2                   2                     0                      24m
      machineconfigpool.machineconfiguration.openshift.io/worker-perf-canary   rendered-worker-perf-canary-6b5fbffac62c3d437e307e849c44b556   True      False      False      1              1                   1                     0                      7m52s
      
      NAME                                             STATUS   ROLES                       AGE    VERSION
      node/ip-10-0-13-228.us-east-2.compute.internal   Ready    worker,worker-perf-canary   138m   v1.30.4
      node/ip-10-0-2-250.us-east-2.compute.internal    Ready    control-plane,master        145m   v1.30.4
      node/ip-10-0-34-223.us-east-2.compute.internal   Ready    control-plane,master        144m   v1.30.4
      node/ip-10-0-35-61.us-east-2.compute.internal    Ready    worker,worker-perf          136m   v1.30.4
      node/ip-10-0-79-232.us-east-2.compute.internal   Ready    control-plane,master        144m   v1.30.4
      node/ip-10-0-86-124.us-east-2.compute.internal   Ready    worker                      139m   v1.30.4
      
      
          

      Expected results:

      MCPs should always report the right number of nodes
          

      Additional info:

      It is very similar to this other issue 
      https://bugzilla.redhat.com/show_bug.cgi?id=2090436
      That was discussed in this slack conversation
      https://redhat-internal.slack.com/archives/C02CZNQHGN8/p1653479831004619
          

              djoshy David Joshy
              sregidor@redhat.com Sergio Regidor de la Rosa
              Sergio Regidor de la Rosa Sergio Regidor de la Rosa
              Votes:
              0 Vote for this issue
              Watchers:
              5 Start watching this issue

                Created:
                Updated: