-
Epic
-
Resolution: Unresolved
-
Major
-
None
-
None
-
None
-
Custom Pool Optimization
-
False
-
False
-
Not Selected
-
To Do
-
OCPSTRAT-958 - Boot new workers directly into custom pool configuration
-
OCPSTRAT-958Boot new workers directly into custom pool configuration
-
100% To Do, 0% In Progress, 0% Done
-
Undefined
-
0
-
0
Spike Goal
- Find a path forward for ensuring that users who boot their machines directly into a custom pool using Ignition do not encounter a race condition which may significantly increase provisioning time
- Evaluate work around: https://access.redhat.com/solutions/6957118
Background
We support (but don’t document?[1]) booting machines directly into a custom pool via the config/$CUSTOM_POOL Ignition endpoint.
The worker config that the custom pool inherits from means that kubelet will be started with --node-labels="node-role.kubernetes.io/worker", and kubelet will create a Node resource with that label.
It is presumed that something (a user, or MAO[1]) will notice the new Node resource and quickly apply the custom role label, ensuring that machine remains in the custom machine config pool.
However, the MCO node controller may get there first, decide that the Node actually belongs in the default worker pool, notice that it’s not running the worker pool config, and request the MCD to apply the worker pool config.
Then, when the custom role label is applied to the node, this will be reversed and the machine will eventually revert back to the config it was initially booted with.
This is particularly problematic with bare metal machines with long reboot times. The cost of losing this race could mean taking up to 4 reboots (~1 hour) to provision the machine.
[1] - https://github.com/openshift/machine-config-operator/blob/master/docs/custom-pools.md
[2] - How you can configure a MachineSet to ensure the correct role label gets copied across - https://github.com/openshift/machine-api-operator/blob/master/FAQ.md#which-annotations-and-labels-get-added-to-nodes
Why is this important?
- Booting directly into a custom pool using Ignition is a supported feature, and this race condition is unpredictable and undesirable behavior.
- Telco RAN use cases are just one situation where we can expect custom machine config pools to be very common - and will usually include site-specific custom pools. Provisioning windows for “far edge” sites in these use cases are typically quite small (for example, 4 hours), and this race condition can eat up a lot of that window.
Scenarios
- When booting a machine into a custom machine config pool using the config/$CUSTOM_POOL Ignition endpoint, the Node should automatically and immediately join that custom pool
- When booting a machine into a custom machine config pool using the config/$CUSTOM_POOL Ignition endpoint, all components must immediately agree that the custom pool is the correct pool
Acceptance Criteria
- CI - MUST be running successfully with tests automated
- Release Technical Enablement - Provide necessary release enablement details and documents.
- ...
Dependencies (internal and external)
- ...
Previous Work (Optional):
- https://issues.redhat.com/browse/KNIDEPLOY-4013
- https://github.com/openshift/enhancements/pull/717
- https://github.com/openshift/enhancements/pull/716
- https://mailman-int.corp.redhat.com/archives/aos-devel/2021-February/thread.html#00013
- https://github.com/openshift-kni/node-label-operator
Open questions::
- …
Done Checklist
- CI - CI is running, tests are automated and merged.
- Release Enablement <link to Feature Enablement Presentation>
- DEV - Upstream code and tests merged: <link to meaningful PR or GitHub Issue>
- DEV - Upstream documentation merged: <link to meaningful PR or GitHub Issue>
- DEV - Downstream build attached to advisory: <link to errata>
- QE - Test plans in Polarion: <link or reference to Polarion>
- QE - Automated tests merged: <link or reference to automated tests>
- DOC - Downstream documentation merged: <link to meaningful PR>
- is related to
-
MCO-650 support booting into custom machine config pools
- To Do
-
OCPSTRAT-958 Boot new workers directly into custom pool configuration
- New
- relates to
-
RFE-2667 Boot nodes directly into custom machineconfig pools during autoscaling
- Backlog