-
Bug
-
Resolution: Unresolved
-
Major
-
None
-
4.21.0
Description of problem:
Performing a backup/restore using Velero/OADP while keeping guest cluster nodes alive results in NodePool reporting the condition ReachedIgnitionEndpoint=False even though the Nodes are healthy. Also, the NodePoolAutorepairEnabledConditionType is not set on the NodePool because it strictly depends on the ReachedIgnitionEndpoint condition to be True. As a consequence, MachineHealthChecks are not created and the nodePool.Spec.Management.AutoRepair feature can't work anymore.
This is caused by the sequence of operation during restore:
1) Velero restores a token secret (named e.g. token-mgencur-hc1-us-east-1a-a28def22)
2) NodePool controller (Hypershift Operator) immediately deletes it because there is no NodePool restored yet
3) Velero restores the NodePool
4) NodePool controller creates a new Token secret that doesn't have the required annotation hypershift.openshift.io/ignition-reached: "True".
5) Since the annotation is now missing and the Nodes do not try to reach the ignition endpoint again (which would set the annotation), the NodePool controller doesn't set the ReachedIgnitionEndpoint condition properly.
See the sequence in logs below.
Logs:
Velero: time="2026-03-02T13:17:54Z" level=debug msg="Creating token-mgencur-hc1-us-east-1a-a28def22" groupResource=secrets logSource="/workspace/pkg/restore/restore.go:1513" namespace=clusters-mgencur-hc1 original name=token-mgencur-hc1-us-east-1a-a28def22 restore=openshift-adp/mgencur-hc1-clusters-f5zxfh Hypershift Operator: {"level":"info","ts":"2026-03-02T13:17:54Z","msg":"removing secret as nodePool is missing","controller":"secret","controllerGroup":"","controllerKind":"Secret","Secret":{"name":"token-mgencur-hc1-us-east-1a-a28def22","namespace":"clusters-mgencur-hc1"},"namespace":"clusters-mgencur-hc1","name":"token-mgencur-hc1-us-east-1a-a28def22","reconcileID":"d50e1f62-1c82-47d4-a367-c44a36526ebe","secret":"clusters-mgencur-hc1/token-mgencur-hc1-us-east-1a-a28def22","nodePool":"clusters/mgencur-hc1-us-east-1a"} Velero: time="2026-03-02T13:18:03Z" level=debug msg="Creating mgencur-hc1-us-east-1a" groupResource=nodepools.hypershift.openshift.io logSource="/workspace/pkg/restore/restore.go:1513" namespace=clusters original name=mgencur-hc1-us-east-1a restore=openshift-adp/mgencur-hc1-clusters-f5zxfh
Version-Release number of selected component (if applicable):
4.21
How reproducible:
Always
Steps to Reproduce:
1. Run tests from https://github.com/openshift/hypershift/pull/7837 using "make test-backup-restore"
2. Check NodePool conditions after restore
Note: During the tests the Control Plane is shut down but the guest cluster is still running so the guest cluster Nodes do not get restarted and don't try to reach the ignition endpoint again after restore
Actual results:
Token secret before restore: apiVersion: v1 data: additional-trust-bundle-hash: ODExYzlkYzU= config: <redacted> hc-configuration-hash: NTQ2NWI4MjU= message: UGF5bG9hZCBnZW5lcmF0ZWQgc3VjY2Vzc2Z1bGx5 pull-secret-hash: ZWZlNjg3Nzg= reason: QXNFeHBlY3RlZA== release: cXVheS5pby9vcGVuc2hpZnQtcmVsZWFzZS1kZXYvb2NwLXJlbGVhc2UtbmlnaHRseUBzaGEyNTY6ZWMwODI0YzYwYTE0NjBkY2FhMmI0ZjJjMjYxNTk0MTE1YjBkYzMxYTRiNmIyYTI4NmJiOGYxYWU0Y2JhMzJlZA== token: ZTM5YzJmY2UtNjNkNC00OGM3LWE2NDctNTQ2Njk3ZjQ4Yzdm immutable: false kind: Secret metadata: annotations: hypershift.openshift.io/ignition-config: "true" hypershift.openshift.io/ignition-reached: "True" hypershift.openshift.io/last-token-generation-time: "2026-03-02T12:47:28.171028131Z" hypershift.openshift.io/node-pool-upgrade-type: Replace hypershift.openshift.io/nodePool: clusters/mgencur-hc1-us-east-1a creationTimestamp: "2026-03-02T12:47:28Z" name: token-mgencur-hc1-us-east-1a-a28def22 namespace: clusters-mgencur-hc1 resourceVersion: "171215" uid: 608a2c40-0c73-4c90-93ac-deff4488ecc6 type: Opaque Token secret after restore: apiVersion: v1 data: additional-trust-bundle-hash: ODExYzlkYzU= config: <redacted> hc-configuration-hash: NTQ2NWI4MjU= pull-secret-hash: ZWZlNjg3Nzg= release: cXVheS5pby9vcGVuc2hpZnQtcmVsZWFzZS1kZXYvb2NwLXJlbGVhc2UtbmlnaHRseUBzaGEyNTY6ZWMwODI0YzYwYTE0NjBkY2FhMmI0ZjJjMjYxNTk0MTE1YjBkYzMxYTRiNmIyYTI4NmJiOGYxYWU0Y2JhMzJlZA== token: YTI5YTAzMDktNGYyNy00NThhLThhODctMGY1NTVhZmQ0Yzll immutable: false kind: Secret metadata: annotations: hypershift.openshift.io/ignition-config: "true" hypershift.openshift.io/last-token-generation-time: "2026-03-02T13:18:03.322348332Z" hypershift.openshift.io/node-pool-upgrade-type: Replace hypershift.openshift.io/nodePool: clusters/mgencur-hc1-us-east-1a creationTimestamp: "2026-03-02T13:18:03Z" name: token-mgencur-hc1-us-east-1a-a28def22 namespace: clusters-mgencur-hc1 resourceVersion: "184428" uid: ee04c0de-f11b-4267-a5b6-4f5d0a2523e2 type: Opaque
See the annotation missing and new timestamp for last-token-generation-time.
Expected results:
The token secret is same as before restore. The Velero/OADP should restore it and it should keep the annotations.
Additional info:
The workaround is to annotate the Secret manually after Restore when NodePool reports AllNodesHealthy:
secret=$(oc get secret -n <control-plane-namespace> -oname | grep token-${nodepool_name})
oc annotate $secret -n <control-plane-namespace> hypershift.openshift.io/ignition-reached="True"
This will also make the condition AutorepairEnabled appear on the NodePool