Uploaded image for project: 'OpenShift Bugs'
  1. OpenShift Bugs
  2. OCPBUGS-77922

CAPI v1.11: MachineDeployment complete calculation flakes when updating

    • None
    • False
    • Hide

      None

      Show
      None
    • None
    • Low
    • None
    • None
    • None
    • None
    • None
    • None
    • None
    • None
    • None
    • None
    • None
    • None

      Description of problem:

      On CAPI 1.11 we have observed a delay in updating a MachineDeployment's replicas while performing a rolling update.
      This causes the calculation of completeness for a machine deployment to briefly report complete when it actually is not.

       

      We need to investigate what is causing this difference in behavior between CAPI 1.10 and 1.11.

       

      Version-Release number of selected component (if applicable):

      CAPA v2.10.0 / CAPI v1.11

      How reproducible:

      Always

       

      Steps to Reproduce:

      1. Build Hypershift and the Hypershift operator with CAPI 1.11 (e.g. from the bump PR).
      2. Install hypershift using the new operator image in a management cluster.
      3. Run the TestNodePool/HostedCluster0/Main/TestRollingUpgrade e2e test on the management cluster with the CAPI 1.11 operator.

      Actual results:

      The MachineDeploymentComplete() has a temporary flake where it returns true when machines are still updating/deleting.

      Expected results:

      The MachineDeploymentComplete function should return true only when old machines are completely removed as it behaves in CAPI v1.10.

       

      Debug information

      Logs of the flake in MachineDeploymentComplete():

      {"level":"info","ts":"2026-03-03T13:08:09Z","msg":"Machine deployment completion:","name":"node-pool-lh767-test-rolling-upgrade","complete":false,"status.Replicas":2,"spec.Replicas":2,"status.AvailableReplicas":2,"status.UpdatedReplicass":1} {"level":"info","ts":"2026-03-03T13:08:09Z","msg":"Machine deployment completion:","name":"node-pool-lh767-test-rolling-upgrade","complete":true,"status.Replicas":2,"spec.Replicas":2,"status.AvailableReplicas":2,"status.UpdatedReplicass":2} {"level":"info","ts":"2026-03-03T13:08:09Z","msg":"Machine deployment completion:","name":"node-pool-lh767-test-rolling-upgrade","complete":true,"status.Replicas":2,"spec.Replicas":2,"status.AvailableReplicas":2,"status.UpdatedReplicass":2} {"level":"info","ts":"2026-03-03T13:08:09Z","msg":"Machine deployment completion:","name":"node-pool-lh767-test-rolling-upgrade","complete":false,"status.Replicas":3,"spec.Replicas":2,"status.AvailableReplicas":2,"status.UpdatedReplicass":2} 

      Here we can see status.UpdatedReplicas going from 1 to 2 but status.Replicas stays at 2 (instead of increasing to 3) for a couple of reconcile cycles.

      On the next reconcile cycle, replicas goes up to 3, showing the flake.

      We can see from printing all machines at the end of the test that there are indeed 3 (2 updated to m5.xlarge, 1 deleting of type m5.large):

      {
        "level": "info",
        "ts": "2026-03-03T14:08:11+01:00",
        "msg": "Checking AWS Machine",
        "Name": "node-pool-lh767-test-rolling-upgrade-4xbgp-bccqh",
        "InstanceType": "m5.xlarge",
        "Annotations": {
          "cluster.x-k8s.io/cloned-from-groupkind": "AWSMachineTemplate.infrastructure.cluster.x-k8s.io",
          "cluster.x-k8s.io/cloned-from-name": "node-pool-lh767-test-rolling-upgrade-5c70ccdf",
          "hypershift.openshift.io/nodePool": "e2e-clusters-4m25s/node-pool-lh767-test-rolling-upgrade",
          "sigs.k8s.io/cluster-api-provider-aws-last-applied-tags": "{\"expirationDate\":\"2026-03-03T16:48:41Z\",\"kubernetes.io/cluster/node-pool-lh767\":\"owned\",\"red-hat-clustertype\":\"rosa\",\"red-hat-managed\":\"true\"}",
          "sigs.k8s.io/cluster-api-provider-last-applied-tags-on-volumes": "{\"vol-05a7365b11d408e2b\":{\"expirationDate\":\"2026-03-03T16:48:41Z\",\"kubernetes.io/cluster/node-pool-lh767\":\"owned\",\"red-hat-clustertype\":\"rosa\",\"red-hat-managed\":\"true\"}}"
        }
      }
      {
        "level": "info",
        "ts": "2026-03-03T14:08:11+01:00",
        "msg": "Checking AWS Machine",
        "Name": "node-pool-lh767-test-rolling-upgrade-4xbgp-zl46p",
        "InstanceType": "m5.xlarge",
        "Annotations": {
          "cluster.x-k8s.io/cloned-from-groupkind": "AWSMachineTemplate.infrastructure.cluster.x-k8s.io",
          "cluster.x-k8s.io/cloned-from-name": "node-pool-lh767-test-rolling-upgrade-5c70ccdf",
          "hypershift.openshift.io/nodePool": "e2e-clusters-4m25s/node-pool-lh767-test-rolling-upgrade",
          "sigs.k8s.io/cluster-api-provider-aws-last-applied-tags": "{\"expirationDate\":\"2026-03-03T16:48:41Z\",\"kubernetes.io/cluster/node-pool-lh767\":\"owned\",\"red-hat-clustertype\":\"rosa\",\"red-hat-managed\":\"true\"}"
        }
      }
      {
        "level": "info",
        "ts": "2026-03-03T14:08:11+01:00",
        "msg": "Checking AWS Machine",
        "Name": "node-pool-lh767-test-rolling-upgrade-qkbsj-v4lgb",
        "InstanceType": "m5.large",
        "Annotations": {
          "cluster.x-k8s.io/cloned-from-groupkind": "AWSMachineTemplate.infrastructure.cluster.x-k8s.io",
          "cluster.x-k8s.io/cloned-from-name": "node-pool-lh767-test-rolling-upgrade-5bcb68aa",
          "hypershift.openshift.io/nodePool": "e2e-clusters-4m25s/node-pool-lh767-test-rolling-upgrade",
          "sigs.k8s.io/cluster-api-provider-aws-last-applied-tags": "{\"expirationDate\":\"2026-03-03T16:48:41Z\",\"kubernetes.io/cluster/node-pool-lh767\":\"owned\",\"red-hat-clustertype\":\"rosa\",\"red-hat-managed\":\"true\"}",
          "sigs.k8s.io/cluster-api-provider-last-applied-tags-on-volumes": "{\"vol-02bd3490c5e198a4c\":{\"expirationDate\":\"2026-03-03T16:48:41Z\",\"kubernetes.io/cluster/node-pool-lh767\":\"owned\",\"red-hat-clustertype\":\"rosa\",\"red-hat-managed\":\"true\"}}"
        }
      } 

              rh-ee-nbrubake Nolan Brubaker
              rh-ee-bclement Borja Clemente Castanera
              Yu Li Yu Li
              None
              Votes:
              0 Vote for this issue
              Watchers:
              2 Start watching this issue

                Created:
                Updated: