XMLWordPrintable

    • Kubernetes-native Infrastructure
    • 3% To Do, 0% In Progress, 97% Done
    • Telco 5G Core

      BAFO: 7.1 Kata Containers support. 

      SUMMARY of GOAL

      To provide OpenShift customers the ability to run any workload which runs in RHEL today with little to no friction from a capabilities perspective. It is expected that most users would be able to use sandboxed containers using runc, but there are known edge cases where the isolation provided by the kernel is not enough, or it is intentionally disabled by a vendor or the end user. The following edge use cases would be made available with Kata Containers:

      • Supporting kernel with different configurations/kernel modules - enables workloads which require custom kernel tuning (sysctl, scheduler changes, cache tuning, etc), custom kernel modules (out of tree, special arguments, etc). See Hitachi use case, slides 14 to 18. This would not of course address cases where the configuration specifically is intended to influence the hcontainer/virtual host kernel.
      • Exclusive access to hardware - workloads which expect/require exclusive access to hardware such as network cards, storage devices, ASICs, FPGAs, or other special devices like Nvidia GPUs. This may include SR-IOV based devices or even devices which aren’t SR-IOV capable.
      • Super Privileged Containers - Workloads which run in a super-privileged container use the standard OCI packaging format for the convenience it offers with proving all of the dependencies and configuration of an application, but requires privileges beyond what is safe with a standard containerized Linux process (using runc). These could be workloads that do not work with the standard Capabilities allowed by CRI-O, may require set-uid root binaries, or may even require complete root privilege to function correctly. These application workloads are not safe to run in a multi-tenant environment like OpenShift and the default security context prevents you from running them. It is recognized that some workloads, which are in effect parts of the platform (e.g. fluentd) or third party infrastructure solutions (e.g. sysdig or twistlock) will always require privileges. The principle of least privilege still applies to these workloads.
      • Isolated multi-tenant code - Supporting multiple untrusted users sharing the same openshift cluster. Running 3rd party workloads from multiple vendors, such as CNF or enterprise applications. For example, two third party CNF vendors will not want their custom settings interfering with packet tuning, or sysctl variables set by other applications. Also, customers may not understand, have visibility into, or trust what these containers are doing, and so prefer to run them with a completely isolated kernel to proactively prevent noisy neighbor problems (from a configuration perspective)
      • Execution Environments for Function as a Service - Each function runs in its own isolated environment, with its own resources and file system view. Use the same techniques as KVM/Qemu to provide security and separation at the infrastructure and execution levels. Execution environments are isolated from one another using several container technologies built in to the Linux kernel. These technologies include (From Security overview of AWS Lambda whitepaper):
        • cgroups – Constrain resource access to limiting CPU, memory, disk throughput,
          and network throughput, per execution environment.
        • namespaces – Group process IDs, user IDs, network interfaces, and other
          resources managed by the Linux kernel. Each execution environment runs in a
          dedicated namespace.
        • seccomp-bpf – Limit the syscalls that can be used from within the execution
          environment.
        • iptables and routing tables – Isolate execution environments from each other.
          chroot – Provide scoped access to the underlying filesystem.
      • Isolated debug code - often, administrators need to delegate administrative control to the pods that an application developer has access to. This is common when the developer has SME knowledge that the administrator does not. This can include things like safely and securely delegating eBPF (today requires CAP_ADMIN or CAP_BPF which gives a developer access to every process on the Container Host worker node), system tap, or even loading custom kernel modules.
      • Isolated vulnerable code - Addressing cases where the user is required to run a containerized workload which may have known vulnerabilities. This could be a result of the need to run legacy applications no longer maintained, customer roadmaps to fix the issue which will take time etc…

      This is a large multi-release effort with support planned as follows:

      • OpenShift 4.6: Private preview for Verizon
      • OpenShift 4.7: Public Technology Preview
      • OpenShift 4.8: Public Technology Preview

      USER STORIES

      As an operator I want to deploy workloads which disable kernel isolation, or for which I feel the kernel does not provide enough isolation.

      REQUIREMENTS

      These are the requirements specifically for 4.6:

      • Bare metal support
      • Operator installation
      • Documentation

      NON REQUIREMENTS

      For 4.6, these are not requirements, but will likely be required by 4.8 or later:

      • SR-IOV
      • IPV6

      REFERENCES

      ASSUMPTIONS

      • OpenShift will provide a predefined CRI Configuration for Kata so that users don't have to create this
      • OpenShift will provide a predefined Runtime Class already configured to consume KataContainers (ex. SecureRuntime). It will be a general name instead of * Kata specific should we ever decide to change the underlying technology.
      • The customer will know how and when to use the precreated (ex SecureRuntime) class
      • This will work on CoreOS and RHEL 8 only
      • No RHEL 7 support

      Feature Done Checklist 

      • CI - CI Job & Automated tests: <link to CI Job & automated tests>
      • Release Enablement: <link to Feature Enablement Presentation> 
      • DEV - Upstream code and tests merged: <link to meaningful PR or GitHub Issue>
      • DEV - Upstream documentation merged: <link to meaningful PR or GitHub Issue>
      • DEV - Downstream build attached to advisory: <link to errata>
      • QE - Test plans in Polarion: <link or reference to Polarion>
      • QE - Automated tests merged: <link or reference to automated tests>
      • DOC - Downstream documentation merged: <link to meaningful PR>

      Notes for Done Checklist

      • Adding links to the above checklist with multiple teams contributing; select a meaningful reference for this Feature.
      • Checklist added to each Feature in the description, to be filled out as phases are completed - tracking progress towards “Done” for the Feature.

              fatherlinux Scott McCarty
              fatherlinux Scott McCarty
              Votes:
              1 Vote for this issue
              Watchers:
              11 Start watching this issue

                Created:
                Updated: