-
Bug
-
Resolution: Unresolved
-
Normal
-
None
-
4.22
-
None
-
False
-
-
None
-
Low
-
None
-
None
-
None
-
None
-
None
-
None
-
None
-
None
-
None
-
None
-
None
-
None
Description of problem:
On CAPI 1.11 we have observed a delay in updating a MachineDeployment's replicas while performing a rolling update.
This causes the calculation of completeness for a machine deployment to briefly report complete when it actually is not.
We need to investigate what is causing this difference in behavior between CAPI 1.10 and 1.11.
Version-Release number of selected component (if applicable):
CAPA v2.10.0 / CAPI v1.11
How reproducible:
Always
Steps to Reproduce:
- Build Hypershift and the Hypershift operator with CAPI 1.11 (e.g. from the bump PR).
- Install hypershift using the new operator image in a management cluster.
- Run the TestNodePool/HostedCluster0/Main/TestRollingUpgrade e2e test on the management cluster with the CAPI 1.11 operator.
Actual results:
The MachineDeploymentComplete() has a temporary flake where it returns true when machines are still updating/deleting.
Expected results:
The MachineDeploymentComplete function should return true only when old machines are completely removed as it behaves in CAPI v1.10.
Debug information
Logs of the flake in MachineDeploymentComplete():
{"level":"info","ts":"2026-03-03T13:08:09Z","msg":"Machine deployment completion:","name":"node-pool-lh767-test-rolling-upgrade","complete":false,"status.Replicas":2,"spec.Replicas":2,"status.AvailableReplicas":2,"status.UpdatedReplicass":1} {"level":"info","ts":"2026-03-03T13:08:09Z","msg":"Machine deployment completion:","name":"node-pool-lh767-test-rolling-upgrade","complete":true,"status.Replicas":2,"spec.Replicas":2,"status.AvailableReplicas":2,"status.UpdatedReplicass":2} {"level":"info","ts":"2026-03-03T13:08:09Z","msg":"Machine deployment completion:","name":"node-pool-lh767-test-rolling-upgrade","complete":true,"status.Replicas":2,"spec.Replicas":2,"status.AvailableReplicas":2,"status.UpdatedReplicass":2} {"level":"info","ts":"2026-03-03T13:08:09Z","msg":"Machine deployment completion:","name":"node-pool-lh767-test-rolling-upgrade","complete":false,"status.Replicas":3,"spec.Replicas":2,"status.AvailableReplicas":2,"status.UpdatedReplicass":2}
Here we can see status.UpdatedReplicas going from 1 to 2 but status.Replicas stays at 2 (instead of increasing to 3) for a couple of reconcile cycles.
On the next reconcile cycle, replicas goes up to 3, showing the flake.
We can see from printing all machines at the end of the test that there are indeed 3 (2 updated to m5.xlarge, 1 deleting of type m5.large):
{
"level": "info",
"ts": "2026-03-03T14:08:11+01:00",
"msg": "Checking AWS Machine",
"Name": "node-pool-lh767-test-rolling-upgrade-4xbgp-bccqh",
"InstanceType": "m5.xlarge",
"Annotations": {
"cluster.x-k8s.io/cloned-from-groupkind": "AWSMachineTemplate.infrastructure.cluster.x-k8s.io",
"cluster.x-k8s.io/cloned-from-name": "node-pool-lh767-test-rolling-upgrade-5c70ccdf",
"hypershift.openshift.io/nodePool": "e2e-clusters-4m25s/node-pool-lh767-test-rolling-upgrade",
"sigs.k8s.io/cluster-api-provider-aws-last-applied-tags": "{\"expirationDate\":\"2026-03-03T16:48:41Z\",\"kubernetes.io/cluster/node-pool-lh767\":\"owned\",\"red-hat-clustertype\":\"rosa\",\"red-hat-managed\":\"true\"}",
"sigs.k8s.io/cluster-api-provider-last-applied-tags-on-volumes": "{\"vol-05a7365b11d408e2b\":{\"expirationDate\":\"2026-03-03T16:48:41Z\",\"kubernetes.io/cluster/node-pool-lh767\":\"owned\",\"red-hat-clustertype\":\"rosa\",\"red-hat-managed\":\"true\"}}"
}
}
{
"level": "info",
"ts": "2026-03-03T14:08:11+01:00",
"msg": "Checking AWS Machine",
"Name": "node-pool-lh767-test-rolling-upgrade-4xbgp-zl46p",
"InstanceType": "m5.xlarge",
"Annotations": {
"cluster.x-k8s.io/cloned-from-groupkind": "AWSMachineTemplate.infrastructure.cluster.x-k8s.io",
"cluster.x-k8s.io/cloned-from-name": "node-pool-lh767-test-rolling-upgrade-5c70ccdf",
"hypershift.openshift.io/nodePool": "e2e-clusters-4m25s/node-pool-lh767-test-rolling-upgrade",
"sigs.k8s.io/cluster-api-provider-aws-last-applied-tags": "{\"expirationDate\":\"2026-03-03T16:48:41Z\",\"kubernetes.io/cluster/node-pool-lh767\":\"owned\",\"red-hat-clustertype\":\"rosa\",\"red-hat-managed\":\"true\"}"
}
}
{
"level": "info",
"ts": "2026-03-03T14:08:11+01:00",
"msg": "Checking AWS Machine",
"Name": "node-pool-lh767-test-rolling-upgrade-qkbsj-v4lgb",
"InstanceType": "m5.large",
"Annotations": {
"cluster.x-k8s.io/cloned-from-groupkind": "AWSMachineTemplate.infrastructure.cluster.x-k8s.io",
"cluster.x-k8s.io/cloned-from-name": "node-pool-lh767-test-rolling-upgrade-5bcb68aa",
"hypershift.openshift.io/nodePool": "e2e-clusters-4m25s/node-pool-lh767-test-rolling-upgrade",
"sigs.k8s.io/cluster-api-provider-aws-last-applied-tags": "{\"expirationDate\":\"2026-03-03T16:48:41Z\",\"kubernetes.io/cluster/node-pool-lh767\":\"owned\",\"red-hat-clustertype\":\"rosa\",\"red-hat-managed\":\"true\"}",
"sigs.k8s.io/cluster-api-provider-last-applied-tags-on-volumes": "{\"vol-02bd3490c5e198a4c\":{\"expirationDate\":\"2026-03-03T16:48:41Z\",\"kubernetes.io/cluster/node-pool-lh767\":\"owned\",\"red-hat-clustertype\":\"rosa\",\"red-hat-managed\":\"true\"}}"
}
}
- relates to
-
OCPBUGS-77514 CAPI v1.11: Nodepool rolling update e2e test fails
-
- POST
-