[RFE-6216] Boot image skew limits - Red Hat Issue Tracker

Type: Feature Request
Resolution: Unresolved
Priority: Undefined
Fix Version/s: None
Affects Version/s: openshift-4.18
Component/s: MCO
Labels:
None

Blocked:
False
Blocked Reason:
None
Ready:
False
Color Status:
Not Selected
Intelligence Requested:
Market:

SFDC Cases Counter:
SFDC Cases Open:
SFDC Cases Links:

1. Proposed title of this feature request

Boot image skew limits

2. What is the nature and description of the request?

Provide a way to warn customers (and potentially block 4.(y+1) updates) when excessively old boot images are in use. Not all customers actively provision new machines, so while having a way to proactively warn customers who have configured MachineSets and such with outdated boot images would be nice, rejecting new machines at initial-Ignition-request-time may be sufficient. Error messages can discuss managedBootImages and link to the KCS about manually updating boot images to help impacted customers unstick themselves.

3. Why does the customer need this? (List the business requirements here)

There are occasional issues when new clusters attempt to use old boot images (~~MCO-540~~, ~~MCO-519~~, ~~MCO-1212~~, ~~COS-1942~~). New features like ClusterImagePolicy also lead to machine-config server Ignition content that needs to be compatible with the boot image that's being asked to pivot to the new OS image. Currently the machine-config server is making compatibility calls based on the Ignition version in the request header. For example:

$ oc -n openshift-machine-config-operator logs -l k8s-app=machine-config-server --tail 1
I0828 22:22:41.449488       1 api.go:116] Pool worker requested by address:"10.0.146.248:57159" User-Agent:"Ignition/2.15.0" Accept-Header: "application/vnd.coreos.ignition+json;version=3.4.0, */*;q=0.1"
I0828 21:18:39.328816       1 api.go:116] Pool worker requested by address:"10.0.183.35:62744" User-Agent:"Ignition/2.15.0" Accept-Header: "application/vnd.coreos.ignition+json;version=3.4.0, */*;q=0.1"
I0828 22:28:43.677961       1 api.go:116] Pool worker requested by address:"10.0.183.35:28692" User-Agent:"Ignition/2.15.0" Accept-Header: "application/vnd.coreos.ignition+json;version=3.4.0, */*;q=0.1"

So we know to serve that node 3.4.0-compatible Ignition. But "which Ignition version?" is only part of the compatibility exposure, it doesn't cover things like "will Podman understand the policy.json config knobs I'm setting?". And it doesn't cover things like "RHCOS 410.8.20190520.0? Nobody is shipping security patches for 4.1 RHCOS anymore".

Having a more robust check in the machine-config server would make for more accessible messaging, because alerting like "you have a recent Ignition request with an incompatibly old boot image, please see..." is more actionable than the current "hey, some of these Machines are failing to join the cluster, good luck rooting around in their serial console output" that we'd generate today when we serve an old boot image some new Ignition it can't handle.

And besides being more accessible to cluster admins, having documented skew guards here would allow component teams to understand when they could reliably use new features that older RHCOS might not be familiar with (OCPBUGS-38809).

Work like the new, tech-preview in 4.16 managedBootImages can help reduce skew, but only in clusters where it is enabled. And in some disconnected/restricted-network or bare-metal-y situations, enabling new boot images requires mirroring and admin work that the cluster is unlikely to be able to automate. So this kind of skew guard would be useful, even in a world where managedBootImages was GA for more cloud providers.

4. List any affected packages or components.

RHCOS/Ignition/MCO. Maybe HyperShift, which also handles new-Machine Ignition?

blocks

OCPNODE-2619 Move ClusterImagePolicy to v1

In Progress

OCPNODE-2690 Move ImagePolicy to v1

In Progress

relates to

OCPBUGS-38809 New nodes scaled using 4.5 base image cannot join the cluster if techpreview is enabled

Verified

links to

enhancements#1698, Update bootimage management enhancement

Details

Description

Attachments

Issue Links

Easy Agile Planning Poker

Activity

People

Dates