-
Feature Request
-
Resolution: Unresolved
-
Undefined
-
None
-
None
-
None
-
None
-
Product / Portfolio Work
-
None
-
False
-
-
None
-
None
-
None
-
-
None
-
None
-
None
-
None
-
None
Enable NVIDIA H100 GPU Support in Confidential Mode for OpenShift sandboxed containers
1. Description
This Request for Enhancement (RFE) outlines the need to evaluate and enable support for NVIDIA H100 GPUs operating in confidential mode within OpenShift sandboxed containers (Confidential Containers). We have a specific customer request for this capability with a target delivery in 2025.
The primary technical challenge is the significant gap between the Linux kernel versions required by NVIDIA for H100 confidential computing and the version currently available in Red Hat Enterprise Linux (RHEL) 9.6. According to NVIDIA's documentation, a minimum host and guest kernel of version 6.9 is required. OpenShift 4.16 / RHEL 9.6 is based on kernel version 5.14.
Successfully enabling this feature will require substantial work on both the host and guest kernel environments, and potentially on other components of the stack. This RFE initiates the investigation to scope the full extent of this effort.
2. Current Status & Analysis
Our initial investigation has focused on the guest kernel requirements and has revealed critical dependencies that make this a non-trivial undertaking.
Guest Kernel Analysis (RHEL 9.6 - Kernel 5.14):
- Core Problem: The NVIDIA driver for H100 confidential computing relies on libspdm to establish a secure session with the GPU. This library now interfaces directly with the Linux Kernel Crypto API (LKCA) for cryptographic operations.
- Missing Dependencies: The RHEL 9.6 kernel (5.14) lacks the necessary internal crypto APIs that libspdm depends on. Specifically, critical header files like crypto/internal/ecc.h and crypto/sig.h are absent.
- Observed Failure: When attempting to load the NVIDIA driver in a RHEL 9.6-based guest, the process fails during the SPDM handshake. The kernel log (dmesg) shows the following fatal error:
libspdm_check_crypto_backend: Error - libspdm expects LKCA but found stubs! NVRM: spdmContextInit_IMPL: SPDM cannot boot without proper crypto backend! NVRM: _kgspEstablishSpdmSession: SPDM handshake with Responder failed.
- NVIDIA Confirmation: An engineer from NVIDIA confirmed that the ecdh_generic and ecdsa_generic modules are essential for completing the SPDM session establishment, further highlighting the missing cryptographic components in the 5.14 kernel.
- Conclusion: To support H100 confidential computing on the RHEL 9.6 guest kernel, a significant number of features and APIs from the crypto subsystem of a much newer kernel (6.9 or later) would need to be identified and backported to kernel 5.14. This is a complex and high-effort task.
Host Kernel Analysis:
- The specific requirements and effort needed to update the host kernel are currently unknown.
- It is reasonable to assume that the host kernel will require a similarly extensive set of backports or modifications to support the necessary features for confidential GPU pass-through. This part of the investigation is outstanding.
3. Next Steps
- Scope Host Kernel Requirements: Initiate a technical investigation to determine the full scope of work required to prepare the RHEL 9.6 host kernel for H100 confidential computing.
- Engage Kernel Team: Collaborate with the Red Hat Kernel team to perform a detailed analysis of the backporting effort for both guest and host kernels. The goal is to get an estimate of the complexity and timeline.
- Full Dependency Analysis: Work with NVIDIA to obtain a comprehensive list of all kernel and user-space dependencies required for this feature to function correctly.