Loading...

XML

Word

Printable

Type: Bug
Resolution: Unresolved
Priority: Major
Fix Version/s: None
Affects Version/s: 4.21.0
Component/s: HyperShift
Labels:
- no_core_payload
- triaged

Activity Type:
None
Blocked:
False
Blocked Reason:

Hide

None

Show
None
Story Points:
None
Severity:
Important
Regression:
None

Target Backport Versions:
None
Target Version:

4.22.0
Release Blocker:
None
Sprint:
None

SFDC Cases Counter:
SFDC Cases Open:
SFDC Cases Links:

PX Impact Score:

Release Note Status:
None
Release Note Type:
None
Release Note Text:
None

Escape Reason:
None
Escape Impact:
None
Corrective Measures:
None
SDLC stage when should've been found:
None

Description of problem:

Performing a backup/restore using Velero/OADP while keeping guest cluster nodes alive results in NodePool reporting the condition ReachedIgnitionEndpoint=False even though the Nodes are healthy. Also, the NodePoolAutorepairEnabledConditionType is not set on the NodePool because it strictly depends on the ReachedIgnitionEndpoint condition to be True. As a consequence, MachineHealthChecks are not created and the nodePool.Spec.Management.AutoRepair feature can't work anymore.

This is caused by the sequence of operation during restore:

1) Velero restores a token secret (named e.g. token-mgencur-hc1-us-east-1a-a28def22)

2) NodePool controller (Hypershift Operator) immediately deletes it because there is no NodePool restored yet

3) Velero restores the NodePool

4) NodePool controller creates a new Token secret that doesn't have the required annotation hypershift.openshift.io/ignition-reached: "True".

5) Since the annotation is now missing and the Nodes do not try to reach the ignition endpoint again (which would set the annotation), the NodePool controller doesn't set the ReachedIgnitionEndpoint condition properly.

See the sequence in logs below.

Logs:

Velero:
time="2026-03-02T13:17:54Z" level=debug msg="Creating token-mgencur-hc1-us-east-1a-a28def22" groupResource=secrets logSource="/workspace/pkg/restore/restore.go:1513" namespace=clusters-mgencur-hc1 original name=token-mgencur-hc1-us-east-1a-a28def22 restore=openshift-adp/mgencur-hc1-clusters-f5zxfh

Hypershift Operator:
{"level":"info","ts":"2026-03-02T13:17:54Z","msg":"removing secret as nodePool is missing","controller":"secret","controllerGroup":"","controllerKind":"Secret","Secret":{"name":"token-mgencur-hc1-us-east-1a-a28def22","namespace":"clusters-mgencur-hc1"},"namespace":"clusters-mgencur-hc1","name":"token-mgencur-hc1-us-east-1a-a28def22","reconcileID":"d50e1f62-1c82-47d4-a367-c44a36526ebe","secret":"clusters-mgencur-hc1/token-mgencur-hc1-us-east-1a-a28def22","nodePool":"clusters/mgencur-hc1-us-east-1a"}

Velero:
time="2026-03-02T13:18:03Z" level=debug msg="Creating mgencur-hc1-us-east-1a" groupResource=nodepools.hypershift.openshift.io logSource="/workspace/pkg/restore/restore.go:1513" namespace=clusters original name=mgencur-hc1-us-east-1a restore=openshift-adp/mgencur-hc1-clusters-f5zxfh

Version-Release number of selected component (if applicable):

    4.21

How reproducible:

    Always

Steps to Reproduce:

    1. Run tests from https://github.com/openshift/hypershift/pull/7837 using "make test-backup-restore"
    2. Check NodePool conditions after restore

Note: During the tests the Control Plane is shut down but the guest cluster is still running so the guest cluster Nodes do not get restarted and don't try to reach the ignition endpoint again after restore

Actual results:

Token secret before restore:

apiVersion: v1
data:
  additional-trust-bundle-hash: ODExYzlkYzU=
  config: <redacted>
  hc-configuration-hash: NTQ2NWI4MjU=
  message: UGF5bG9hZCBnZW5lcmF0ZWQgc3VjY2Vzc2Z1bGx5
  pull-secret-hash: ZWZlNjg3Nzg=
  reason: QXNFeHBlY3RlZA==
  release: cXVheS5pby9vcGVuc2hpZnQtcmVsZWFzZS1kZXYvb2NwLXJlbGVhc2UtbmlnaHRseUBzaGEyNTY6ZWMwODI0YzYwYTE0NjBkY2FhMmI0ZjJjMjYxNTk0MTE1YjBkYzMxYTRiNmIyYTI4NmJiOGYxYWU0Y2JhMzJlZA==
  token: ZTM5YzJmY2UtNjNkNC00OGM3LWE2NDctNTQ2Njk3ZjQ4Yzdm
immutable: false
kind: Secret
metadata:
  annotations:
    hypershift.openshift.io/ignition-config: "true"
    hypershift.openshift.io/ignition-reached: "True"
    hypershift.openshift.io/last-token-generation-time: "2026-03-02T12:47:28.171028131Z"
    hypershift.openshift.io/node-pool-upgrade-type: Replace
    hypershift.openshift.io/nodePool: clusters/mgencur-hc1-us-east-1a
  creationTimestamp: "2026-03-02T12:47:28Z"
  name: token-mgencur-hc1-us-east-1a-a28def22
  namespace: clusters-mgencur-hc1
  resourceVersion: "171215"
  uid: 608a2c40-0c73-4c90-93ac-deff4488ecc6
type: Opaque

Token secret after restore:
apiVersion: v1
data:
  additional-trust-bundle-hash: ODExYzlkYzU=
  config: <redacted>
  hc-configuration-hash: NTQ2NWI4MjU=
  pull-secret-hash: ZWZlNjg3Nzg=
  release: cXVheS5pby9vcGVuc2hpZnQtcmVsZWFzZS1kZXYvb2NwLXJlbGVhc2UtbmlnaHRseUBzaGEyNTY6ZWMwODI0YzYwYTE0NjBkY2FhMmI0ZjJjMjYxNTk0MTE1YjBkYzMxYTRiNmIyYTI4NmJiOGYxYWU0Y2JhMzJlZA==
  token: YTI5YTAzMDktNGYyNy00NThhLThhODctMGY1NTVhZmQ0Yzll
immutable: false
kind: Secret
metadata:
  annotations:
    hypershift.openshift.io/ignition-config: "true"
    hypershift.openshift.io/last-token-generation-time: "2026-03-02T13:18:03.322348332Z"
    hypershift.openshift.io/node-pool-upgrade-type: Replace
    hypershift.openshift.io/nodePool: clusters/mgencur-hc1-us-east-1a
  creationTimestamp: "2026-03-02T13:18:03Z"
  name: token-mgencur-hc1-us-east-1a-a28def22
  namespace: clusters-mgencur-hc1
  resourceVersion: "184428"
  uid: ee04c0de-f11b-4267-a5b6-4f5d0a2523e2
type: Opaque

See the annotation missing and new timestamp for last-token-generation-time.

Expected results:

    The token secret is same as before restore. The Velero/OADP should restore it and it should keep the annotations.

Additional info:

    The workaround is to annotate the Secret manually after Restore when NodePool reports AllNodesHealthy:
secret=$(oc get secret -n <control-plane-namespace> -oname | grep token-${nodepool_name})
oc annotate $secret -n <control-plane-namespace> hypershift.openshift.io/ignition-reached="True"

This will also make the condition AutorepairEnabled appear on the NodePool

links to

openshift/hypershift#7851: OCPBUGS-77621: fix(nodepool): preserve ignition-reached annotation on token secret after restore

Assignee:: Juan Manuel Parrilla Madrid

Reporter:: Martin Gencur

QA Contact:: Martin Gencur

Need Info From:: None

Votes:: 0 Vote for this issue

Watchers:: 3 Start watching this issue

Created:: 2026/03/03 5:54 AM

Updated:: 2026/03/06 12:58 AM

Details

Description

Attachments

Issue Links

Easy Agile Planning Poker

Activity

People

Dates