Uploaded image for project: 'OpenShift Bugs'
  1. OpenShift Bugs
  2. OCPBUGS-77621

NodePool after backup/restore has condition ReachedIgnitionEndpoint=False

    • None
    • False
    • Hide

      None

      Show
      None
    • None
    • Important
    • None
    • None
    • None
    • None
    • None
    • None
    • None
    • None
    • None
    • None
    • None

      Description of problem:

      Performing a backup/restore using Velero/OADP while keeping guest cluster nodes alive results in NodePool reporting the condition ReachedIgnitionEndpoint=False even though the Nodes are healthy. Also, the NodePoolAutorepairEnabledConditionType is not set on the NodePool because it strictly depends on the ReachedIgnitionEndpoint condition to be True. As a consequence, MachineHealthChecks are not created and the nodePool.Spec.Management.AutoRepair feature can't work anymore.

      This is caused by the sequence of operation during restore:

      1) Velero restores a token secret (named e.g. token-mgencur-hc1-us-east-1a-a28def22)

      2) NodePool controller (Hypershift Operator) immediately deletes it because there is no NodePool restored yet

      3) Velero restores the NodePool

      4) NodePool controller creates a new Token secret that doesn't have the required annotation hypershift.openshift.io/ignition-reached: "True". 

      5) Since the annotation is now missing and the Nodes do not try to reach the ignition endpoint again (which would set the annotation), the NodePool controller doesn't set the ReachedIgnitionEndpoint condition properly.

      See the sequence in logs below.

      Logs:

      Velero:
      time="2026-03-02T13:17:54Z" level=debug msg="Creating token-mgencur-hc1-us-east-1a-a28def22" groupResource=secrets logSource="/workspace/pkg/restore/restore.go:1513" namespace=clusters-mgencur-hc1 original name=token-mgencur-hc1-us-east-1a-a28def22 restore=openshift-adp/mgencur-hc1-clusters-f5zxfh
      
      Hypershift Operator:
      {"level":"info","ts":"2026-03-02T13:17:54Z","msg":"removing secret as nodePool is missing","controller":"secret","controllerGroup":"","controllerKind":"Secret","Secret":{"name":"token-mgencur-hc1-us-east-1a-a28def22","namespace":"clusters-mgencur-hc1"},"namespace":"clusters-mgencur-hc1","name":"token-mgencur-hc1-us-east-1a-a28def22","reconcileID":"d50e1f62-1c82-47d4-a367-c44a36526ebe","secret":"clusters-mgencur-hc1/token-mgencur-hc1-us-east-1a-a28def22","nodePool":"clusters/mgencur-hc1-us-east-1a"}
      
      Velero:
      time="2026-03-02T13:18:03Z" level=debug msg="Creating mgencur-hc1-us-east-1a" groupResource=nodepools.hypershift.openshift.io logSource="/workspace/pkg/restore/restore.go:1513" namespace=clusters original name=mgencur-hc1-us-east-1a restore=openshift-adp/mgencur-hc1-clusters-f5zxfh 

      Version-Release number of selected component (if applicable):

          4.21

      How reproducible:

          Always

      Steps to Reproduce:

          1. Run tests from https://github.com/openshift/hypershift/pull/7837 using "make test-backup-restore"
          2. Check NodePool conditions after restore
      
      Note: During the tests the Control Plane is shut down but the guest cluster is still running so the guest cluster Nodes do not get restarted and don't try to reach the ignition endpoint again after restore

      Actual results:

      Token secret before restore:
      
      apiVersion: v1
      data:
        additional-trust-bundle-hash: ODExYzlkYzU=
        config: <redacted>
        hc-configuration-hash: NTQ2NWI4MjU=
        message: UGF5bG9hZCBnZW5lcmF0ZWQgc3VjY2Vzc2Z1bGx5
        pull-secret-hash: ZWZlNjg3Nzg=
        reason: QXNFeHBlY3RlZA==
        release: cXVheS5pby9vcGVuc2hpZnQtcmVsZWFzZS1kZXYvb2NwLXJlbGVhc2UtbmlnaHRseUBzaGEyNTY6ZWMwODI0YzYwYTE0NjBkY2FhMmI0ZjJjMjYxNTk0MTE1YjBkYzMxYTRiNmIyYTI4NmJiOGYxYWU0Y2JhMzJlZA==
        token: ZTM5YzJmY2UtNjNkNC00OGM3LWE2NDctNTQ2Njk3ZjQ4Yzdm
      immutable: false
      kind: Secret
      metadata:
        annotations:
          hypershift.openshift.io/ignition-config: "true"
          hypershift.openshift.io/ignition-reached: "True"
          hypershift.openshift.io/last-token-generation-time: "2026-03-02T12:47:28.171028131Z"
          hypershift.openshift.io/node-pool-upgrade-type: Replace
          hypershift.openshift.io/nodePool: clusters/mgencur-hc1-us-east-1a
        creationTimestamp: "2026-03-02T12:47:28Z"
        name: token-mgencur-hc1-us-east-1a-a28def22
        namespace: clusters-mgencur-hc1
        resourceVersion: "171215"
        uid: 608a2c40-0c73-4c90-93ac-deff4488ecc6
      type: Opaque
      
      Token secret after restore:
      apiVersion: v1
      data:
        additional-trust-bundle-hash: ODExYzlkYzU=
        config: <redacted>
        hc-configuration-hash: NTQ2NWI4MjU=
        pull-secret-hash: ZWZlNjg3Nzg=
        release: cXVheS5pby9vcGVuc2hpZnQtcmVsZWFzZS1kZXYvb2NwLXJlbGVhc2UtbmlnaHRseUBzaGEyNTY6ZWMwODI0YzYwYTE0NjBkY2FhMmI0ZjJjMjYxNTk0MTE1YjBkYzMxYTRiNmIyYTI4NmJiOGYxYWU0Y2JhMzJlZA==
        token: YTI5YTAzMDktNGYyNy00NThhLThhODctMGY1NTVhZmQ0Yzll
      immutable: false
      kind: Secret
      metadata:
        annotations:
          hypershift.openshift.io/ignition-config: "true"
          hypershift.openshift.io/last-token-generation-time: "2026-03-02T13:18:03.322348332Z"
          hypershift.openshift.io/node-pool-upgrade-type: Replace
          hypershift.openshift.io/nodePool: clusters/mgencur-hc1-us-east-1a
        creationTimestamp: "2026-03-02T13:18:03Z"
        name: token-mgencur-hc1-us-east-1a-a28def22
        namespace: clusters-mgencur-hc1
        resourceVersion: "184428"
        uid: ee04c0de-f11b-4267-a5b6-4f5d0a2523e2
      type: Opaque

      See the annotation missing and new timestamp for last-token-generation-time.

      Expected results:

          The token secret is same as before restore. The Velero/OADP should restore it and it should keep the annotations.

      Additional info:

          The workaround is to annotate the Secret manually after Restore when NodePool reports AllNodesHealthy:
      secret=$(oc get secret -n <control-plane-namespace> -oname | grep token-${nodepool_name})
      oc annotate $secret -n <control-plane-namespace> hypershift.openshift.io/ignition-reached="True"
      
      This will also make the condition AutorepairEnabled appear on the NodePool

              jparrill@redhat.com Juan Manuel Parrilla Madrid
              mgencur@redhat.com Martin Gencur
              Martin Gencur Martin Gencur
              None
              Votes:
              0 Vote for this issue
              Watchers:
              3 Start watching this issue

                Created:
                Updated: