-
Feature
-
Resolution: Unresolved
-
Blocker
-
None
-
None
-
Kubernetes-native Infrastructure
-
3% To Do, 0% In Progress, 97% Done
-
Telco 5G Core
BAFO: 7.1 Kata Containers support.
SUMMARY of GOAL
To provide OpenShift customers the ability to run any workload which runs in RHEL today with little to no friction from a capabilities perspective. It is expected that most users would be able to use sandboxed containers using runc, but there are known edge cases where the isolation provided by the kernel is not enough, or it is intentionally disabled by a vendor or the end user. The following edge use cases would be made available with Kata Containers:
- Supporting kernel with different configurations/kernel modules - enables workloads which require custom kernel tuning (sysctl, scheduler changes, cache tuning, etc), custom kernel modules (out of tree, special arguments, etc). See Hitachi use case, slides 14 to 18. This would not of course address cases where the configuration specifically is intended to influence the hcontainer/virtual host kernel.
- Exclusive access to hardware - workloads which expect/require exclusive access to hardware such as network cards, storage devices, ASICs, FPGAs, or other special devices like Nvidia GPUs. This may include SR-IOV based devices or even devices which aren’t SR-IOV capable.
- Super Privileged Containers - Workloads which run in a super-privileged container use the standard OCI packaging format for the convenience it offers with proving all of the dependencies and configuration of an application, but requires privileges beyond what is safe with a standard containerized Linux process (using runc). These could be workloads that do not work with the standard Capabilities allowed by CRI-O, may require set-uid root binaries, or may even require complete root privilege to function correctly. These application workloads are not safe to run in a multi-tenant environment like OpenShift and the default security context prevents you from running them. It is recognized that some workloads, which are in effect parts of the platform (e.g. fluentd) or third party infrastructure solutions (e.g. sysdig or twistlock) will always require privileges. The principle of least privilege still applies to these workloads.
- Isolated multi-tenant code - Supporting multiple untrusted users sharing the same openshift cluster. Running 3rd party workloads from multiple vendors, such as CNF or enterprise applications. For example, two third party CNF vendors will not want their custom settings interfering with packet tuning, or sysctl variables set by other applications. Also, customers may not understand, have visibility into, or trust what these containers are doing, and so prefer to run them with a completely isolated kernel to proactively prevent noisy neighbor problems (from a configuration perspective)
- Execution Environments for Function as a Service - Each function runs in its own isolated environment, with its own resources and file system view. Use the same techniques as KVM/Qemu to provide security and separation at the infrastructure and execution levels. Execution environments are isolated from one another using several container technologies built in to the Linux kernel. These technologies include (From Security overview of AWS Lambda whitepaper):
- cgroups – Constrain resource access to limiting CPU, memory, disk throughput,
and network throughput, per execution environment. - namespaces – Group process IDs, user IDs, network interfaces, and other
resources managed by the Linux kernel. Each execution environment runs in a
dedicated namespace. - seccomp-bpf – Limit the syscalls that can be used from within the execution
environment. - iptables and routing tables – Isolate execution environments from each other.
chroot – Provide scoped access to the underlying filesystem.
- cgroups – Constrain resource access to limiting CPU, memory, disk throughput,
- Isolated debug code - often, administrators need to delegate administrative control to the pods that an application developer has access to. This is common when the developer has SME knowledge that the administrator does not. This can include things like safely and securely delegating eBPF (today requires CAP_ADMIN or CAP_BPF which gives a developer access to every process on the Container Host worker node), system tap, or even loading custom kernel modules.
- Isolated vulnerable code - Addressing cases where the user is required to run a containerized workload which may have known vulnerabilities. This could be a result of the need to run legacy applications no longer maintained, customer roadmaps to fix the issue which will take time etc…
This is a large multi-release effort with support planned as follows:
- OpenShift 4.6: Private preview for Verizon
- OpenShift 4.7: Public Technology Preview
- OpenShift 4.8: Public Technology Preview
USER STORIES
As an operator I want to deploy workloads which disable kernel isolation, or for which I feel the kernel does not provide enough isolation.
REQUIREMENTS
These are the requirements specifically for 4.6:
- Bare metal support
- Operator installation
- Documentation
NON REQUIREMENTS
For 4.6, these are not requirements, but will likely be required by 4.8 or later:
- SR-IOV
- IPV6
REFERENCES
- Why are Red Hat investing in Kata Containers?
- OpenShift Enhancement Proposal
- What do we think is the value of the kata containers project?
ASSUMPTIONS
- OpenShift will provide a predefined CRI Configuration for Kata so that users don't have to create this
- OpenShift will provide a predefined Runtime Class already configured to consume KataContainers (ex. SecureRuntime). It will be a general name instead of * Kata specific should we ever decide to change the underlying technology.
- The customer will know how and when to use the precreated (ex SecureRuntime) class
- This will work on CoreOS and RHEL 8 only
- No RHEL 7 support
Feature Done Checklist
- CI - CI Job & Automated tests: <link to CI Job & automated tests>
- Release Enablement: <link to Feature Enablement Presentation>
- DEV - Upstream code and tests merged: <link to meaningful PR or GitHub Issue>
- DEV - Upstream documentation merged: <link to meaningful PR or GitHub Issue>
- DEV - Downstream build attached to advisory: <link to errata>
- QE - Test plans in Polarion: <link or reference to Polarion>
- QE - Automated tests merged: <link or reference to automated tests>
- DOC - Downstream documentation merged: <link to meaningful PR>
Notes for Done Checklist:
- Adding links to the above checklist with multiple teams contributing; select a meaningful reference for this Feature.
- Checklist added to each Feature in the description, to be filled out as phases are completed - tracking progress towards “Done” for the Feature.