Loading...

Type: Feature Request
Resolution: Unresolved
Priority: Normal
Fix Version/s: None
Affects Version/s: None
Component/s: Installer, Node, updates
Labels:
- CORS
- cee.neXT
- install
- scale-up
- upgrade

Target Version:
None
Activity Type:
Product / Portfolio Work
Status Summary:
None
Blocked:
None
Blocked Reason:
None
Products:
None
Hierarchy Progress Bar:
None

SFDC Cases Counter:
SFDC Cases Open:
SFDC Cases Links:

PX Review Complete:
PX Impact Score:
PX Impact Range:
None
PX Priority Data:
None
PX Technical Impact:
None
PX Technical Impact Notes:
None
PX Scheduling Request:
None

1. Proposed title of this feature request

Share images between OCP nodes to reduce network bandwidth needed for installing/upgrading/growing clusters, and reduce Quay.io load and bandwidth.

2. What is the nature and description of the request?

For OCP cluster installations and upgrades, the same images are pulled many times from Quay.io (for connected clusters), causing a high bandwidth requirement for the installation/upgrade to not fail, and also high bandwidth and load in Quay.io.

For node creation, images already present in other nodes in the same cluster are also pulled from Quay.io.

There could be different approaches to implement this feature. Some examples (maybe several could be implemented in "parallel") could be:

Deploy a "tiny" registry pod only for OCP "core" images (different than the internal registry):

- During installation, that pod can be deployed in the bootstrap node.
  After bootstrap is already deployed and running, it should pull all the images required for the installation in advance, to later provide them to control plane and compute nodes (while the control plane and compute machines starts to be created). That way, most of the images (if not all) could be already available in the bootstrap/registry when required by the rest of the machines.
- That "tiny" registry could be also a "cache-registry" to pull the OCP "core" images required by the nodes and keep them when any node tries to pull any OCP "core" image.
  - Synchronize the images from the bootstrap to the rest of nodes (maybe with `skopeo copy` or other available ways), specially for images not already pulled by nodes.
  - (optional to review) Maybe configure the "neverContactSource" option in the nodes (only for the OCP "core" images), to only allow the registry pod to pull those images (and cache them)? But if this is enabled, if that registry pod fails, with that option it will be needed to fix it to allow the install/upgrade to continue.
- Before the bootstrap is deleted, move the "tiny" registry pod into one (or several) nodes, to serve the images to the rest of the nodes in the cluster.
- During upgrades, start pulling images from the first upgrade steps (for example from the registry pod) and synchronize them to the rest of the nodes (even before the nodes need the images for the upgrade).
- (optional to check) That "tiny" registry should be maybe disabled for "edge" nodes, as it could be that pulling from Quay.io is near and faster than pulling from the cluster. But they could benefit also from the image synchronization (specially for upgrades), if it is done in advance (before the node needs the images).

While the cluster is running, keep the images synchronized between nodes (maybe with `skopeo copy` or other available way).
- When a new node is created, images can be synchronized at the first steps of the machine creation process, to get them available when needed, and if not yet in the node, they could be pulled from the "tiny" registry pod in the cluster.

(optional) Depending on the "extra" disk size of the nodes, maybe other Red Hat operators could be also included in the "tiny" registry pod and image syncrhonization between nodes.

Use something like the "pre-caching" images for SNO [1], via the machine running the installer, for cluster installations.

Other approaches for the same could be also investigated.

3. Why does the customer need this? (List the business requirements here)

There are different cases from different customers reporting installation issues that are finally identified as low network bandwidth to the internet. Recently, it was requested to document the minimal bandwidth required for installation (RFE-5699).

But the issues are not only in installation but also in upgrades or even new node creation, as the images are pulled from the upgrading/new nodes from Quay.io (in connected clusters).

There have also been reported Quay.io issues that seem related to high load.

There are many improvements this RFE can help with:

Customer's bandwidth reduction for install/upgrade/creating new nodes in any OCP 4 cluster.
Speedup install/upgrade/node creation.
Quay.io bandwidth and load reduction.
Even HCP clusters will benefit from this for new nodes creation.
Node(s) unexpected restart, which causes images being deletes from node(s) [2], will also benefit from this.
Probably clusters hibernated will benefit from this when they are started again.
Improve Customer Experience with OpenShift installation/upgrade/node creation.

Note: This feature could require increasing the minimal disk requirement for the nodes, but generally disk space is cheap and the benefits deserve more disk usage. This feature could be disabled (and customers will lose all those benefits) if the nodes doesn't have enough disk space to have all the payload images (including current version payload and potential new version during upgrades at the same time).

4. List any affected packages or components.

OCP Installer, OCP nodes.

[1] https://docs.openshift.com/container-platform/4.17/edge_computing/ztp-precaching-tool.html
[2] https://access.redhat.com/solutions/7022310