Uploaded image for project: 'OpenShift Autoscaling'
  1. OpenShift Autoscaling
  2. AUTOSCALE-363

HCP token rotation causes unneccessary drift

XMLWordPrintable

    • Icon: Epic Epic
    • Resolution: Unresolved
    • Icon: Normal Normal
    • None
    • None
    • AutoNode
    • None
    • AutoNode token rotation drift
    • Product / Portfolio Work
    • False
    • Hide

      None

      Show
      None
    • False
    • In Progress
    • OCPSTRAT-2336 - [GA] AutoNode (Native Karpenter) with ROSA-HCP
    • OCPSTRAT-2336[GA] AutoNode (Native Karpenter) with ROSA-HCP
    • 75% To Do, 25% In Progress, 0% Done

      Goal

      • Define and implement the mechanism for fixing the bug where a ignition token rotation in a hosted control plane causes Karpenter Drift functionality and unnecessary rollout. 

      Why is this important?

      • This will prevent unnecessary rollout when a token rotates (roughly 5.5 hours). Here's some internal details about this bug:
      • As a background, Karpenter has a feature called Drift, which is a form of disruption where if there are certain fields that differ from existing NodeClaims versus it's parent NodePool or EC2NodeClass, then Karpenter will mark those differing nodeclaims as "Drifted" and re-rollout new NodeClaims and nodes with the new spec, as a form of cluster reconciliation. 

      There's a field in the EC2NodeClass called "userData" which is a field which allows users to pass in what AWS calls "userData"[1] directly to EC2 instances that Karpenter will provision, as an input for bootstrapping logic. This field is a raw string format, and it is a field that Karpenter considers for drift.

      Now for a background on HyperShift, when hypershift creates guest clusters using the hypershift-operator, it uses something called an "ignition-server" in order to serve CoreOS Ignition configuration bootstrapping files, and this is how at least in ROSA, EC2 instances with CoreOS AMIs are able to be bootstrapped with correct userData based on the coreos version that is being pinned to a OCP release that the user created the guest cluster with. But these ignition servers require authentication in order to send a request to, and that's where these bearer tokens come in.

      These tokens are rotated by hypershift every 5.5 hours for security reasons, and when they do, this causes the userData field in the EC2NodeClass to drift, so unintentionally, user's Karpenter nodes in autonode will drift at this interval for no particular reason, as the rotation is completely transparent to the guest cluster admin's perspective.

      Solutions (from this spike):

      Solutions in order from hackiest to most complete:

      1. Disable token rotation altogether for the Karpenter hyperv1.NodePool
      2. Carry patch our downstream aws-karpenter to ignore token rotation changes.
      3. Open an upstream RFE to allow non driftable templating variables to the userData field, and implement it.

      1) Alberto notes that this is not ideal for their HyperShift reasons.

      2) This would work, but it does not promote upstream help, and it is a brittle solution. And we would have to maintain a carry patch.

      3) This is the best solution to solve this issue for upstream, as well as introduce template variables which is apparently a somewhat wanted issue in the upstream. However, this will take at least 4 or 5 sprints in order to

      a) Get review, consensus, and approval on an RFE implementation (I already opened one here: https://github.com/aws/karpenter-provider-aws/pull/8357)

      b) Actually implement the solution upstream and get reviews, write tests, docs, etc.

      c) Downstream the fix

      For now I've decided to go with option 3, but if there's a point where we are close to GA and we absolutely need a fix, and we have not gotten close to finishing 3), it is relatively simple to actually implement option 2, until we get option 3 ready.

              rh-ee-macao Max Cao
              rh-ee-macao Max Cao
              Votes:
              0 Vote for this issue
              Watchers:
              1 Start watching this issue

                Created:
                Updated: