Logo Trilium Web Clipper

对容器设备接口的支持 — NVIDIA Container Toolkit 1.17.3 文档 --- Support for Container Device Interface — NVIDIA Container Toolkit 1.17.3 documentation

About the Container Device Interface
关于容器设备接口

As of the v1.12.0 release the NVIDIA Container Toolkit includes support for generating Container Device Interface (CDI) specifications.
v1.12.0版本起,NVIDIA 容器工具包支持生成容器设备接口 (CDI) 规范。

CDI is an open specification for container runtimes that abstracts what access to a device, such as an NVIDIA GPU, means, and standardizes access across container runtimes. Popular container runtimes can read and process the specification to ensure that a device is available in a container. CDI simplifies adding support for devices such as NVIDIA GPUs because the specification is applicable to all container runtimes that support CDI.
CDI 是容器运行时的开放规范,它抽象了对设备(例如 NVIDIA GPU)的访问意味着什么,并标准化了跨容器运行时的访问。流行的容器运行时可以读取并处理规范,以确保设备在容器中可用。 CDI 简化了添加对 NVIDIA GPU 等设备的支持,因为该规范适用于所有支持 CDI 的容器运行时。

CDI also improves the compatibility of the NVIDIA container stack with certain features such as rootless containers.
CDI 还提高了 NVIDIA 容器堆栈与某些功能(例如无根容器)的兼容性。

Generating a CDI specification
生成 CDI 规范

Prerequisites
先决条件

  • You installed either the NVIDIA Container Toolkit or you installed the nvidia-container-toolkit-base package. The base package includes the container runtime and the nvidia-ctk command-line interface, but avoids installing the container runtime hook and transitive dependencies. The hook and dependencies are not needed on machines that use CDI exclusively.
    您安装了 NVIDIA Container Toolkit 或安装了nvidia-container-toolkit-base软件包。基础包包括容器运行时和nvidia-ctk命令行界面,但避免安装容器运行时挂钩和传递依赖项。专门使用 CDI 的计算机不需要挂钩和依赖项。

  • You installed an NVIDIA GPU Driver.
    您安装了 NVIDIA GPU 驱动程序。

Procedure  程序

Two common locations for CDI specifications are /etc/cdi/ and /var/run/cdi/. The contents of the /var/run/cdi/ directory are cleared on boot.
CDI 规范的两个常见位置是/etc/cdi//var/run/cdi//var/run/cdi/目录的内容在启动时被清除。

However, the path to create and use can depend on the container engine that you use.
但是,创建和使用的路径可能取决于您使用的容器引擎。

  1. Generate the CDI specification file:
    生成CDI规范文件:

    $ sudo nvidia-ctk cdi generate --output=/etc/cdi/nvidia.yaml
    

    The sample command uses sudo to ensure that the file at /etc/cdi/nvidia.yaml is created. You can omit the --output argument to print the generated specification to STDOUT.
    示例命令使用sudo确保创建位于/etc/cdi/nvidia.yaml的文件。您可以省略--output参数以将生成的规范打印到STDOUT

    Example Output  示例输出

    INFO[0000] Auto-detected mode as "nvml"
    INFO[0000] Selecting /dev/nvidia0 as /dev/nvidia0
    INFO[0000] Selecting /dev/dri/card1 as /dev/dri/card1
    INFO[0000] Selecting /dev/dri/renderD128 as /dev/dri/renderD128
    INFO[0000] Using driver version xxx.xxx.xx
    ...
    
  2. (Optional) Check the names of the generated devices:
    (可选)检查生成的设备名称:

    The following example output is for a machine with a single GPU that does not support MIG.
    以下示例输出适用于具有不支持 MIG 的单个 GPU 的计算机。

    INFO[0000] Found 9 CDI devices
    nvidia.com/gpu=all
    nvidia.com/gpu=0
    

Important  重要的

You must generate a new CDI specification after any of the following changes:
在进行以下任何更改后,您必须生成新的 CDI 规范:

  • You change the device or CUDA driver configuration.
    您更改设备或 CUDA 驱动程序配置。

  • You use a location such as /var/run/cdi that is cleared on boot.
    您使用的位置例如/var/run/cdi会在启动时清除。

A configuration change can occur when MIG devices are created or removed, or when the driver is upgraded.
创建或删除 MIG 设备或升级驱动程序时,可能会发生配置更改。

Running a Workload with CDI
使用 CDI 运行工作负载

Using CDI to inject NVIDIA devices can conflict with using the NVIDIA Container Runtime hook. This means that if a /usr/share/containers/oci/hooks.d/oci-nvidia-hook.json file exists, delete it or ensure that you do not run containers with the NVIDIA_VISIBLE_DEVICES environment variable set.
使用 CDI 注入 NVIDIA 设备可能会与使用 NVIDIA 容器运行时挂钩发生冲突。这意味着如果一个 /usr/share/containers/oci/hooks.d/oci-nvidia-hook.json 文件存在,请将其删除或确保不运行设置了NVIDIA_VISIBLE_DEVICES环境变量的容器。

The use of the CDI specification is dependent on the CDI-enabled container engine or CLI that you use. In the case of podman, for example, releases as of v4.1.0 include support for specifying CDI devices in the --device argument. Assuming that you generated a CDI specification as in the preceding section, running a container with access to all NVIDIA GPUs would require the following command:
CDI 规范的使用取决于您使用的启用 CDI 的容器引擎或 CLI。例如,对于podman ,从v4.1.0开始的版本包括对在--device参数中指定 CDI 设备的支持。假设您按照上一节生成了 CDI 规范,则运行可访问所有 NVIDIA GPU 的容器将需要以下命令:

$ podman run --rm --device nvidia.com/gpu=all --security-opt=label=disable ubuntu nvidia-smi -L

The preceding sample command should show the same output as running nvidia-smi -L on the host.
前面的示例命令应显示与在主机上运行nvidia-smi -L相同的输出。

The CDI specification also contains references to individual GPUs or MIG devices. You can request these by specifying their names when launching a container, such as the following example:
CDI 规范还包含对各个 GPU 或 MIG 设备的引用。您可以在启动容器时通过指定它们的名称来请求它们,例如以下示例:

$ podman run --rm \
    --device nvidia.com/gpu=0 \
    --device nvidia.com/gpu=1:0 \
    --security-opt=label=disable \
    ubuntu nvidia-smi -L

The preceding sample command requests the full GPU with index 0 and the first MIG device on GPU 1. The output should show only the UUIDs of the requested devices.
前面的示例命令请求索引为 0 的完整 GPU 和 GPU 1 上的第一个 MIG 设备。输出应仅显示所请求设备的 UUID。

Using CDI with Non-CDI-Enabled Runtimes
将 CDI 与不支持 CDI 的运行时结合使用

To support runtimes that do not natively support CDI, you can configure the NVIDIA Container Runtime in a cdi mode. In this mode, the NVIDIA Container Runtime does not inject the NVIDIA Container Runtime Hook into the incoming OCI runtime specification. Instead, the runtime performs the injection of the requested CDI devices.
要支持本身不支持 CDI 的运行时,您可以在cdi模式下配置 NVIDIA 容器运行时。在此模式下,NVIDIA Container Runtime 不会将 NVIDIA Container Runtime Hook 注入传入的 OCI 运行时规范中。相反,运行时执行所请求的 CDI 设备的注入。

The NVIDIA Container Runtime automatically uses cdi mode if you request devices by their CDI device names.
如果您通过 CDI 设备名称请求设备,NVIDIA 容器运行时会自动使用cdi模式。

Using Docker as an example of a non-CDI-enabled runtime, the following command uses CDI to inject the requested devices into the container:
使用 Docker 作为不支持 CDI 的运行时示例,以下命令使用 CDI 将请求的设备注入容器中:

$ docker run --rm -ti --runtime=nvidia \
    -e NVIDIA_VISIBLE_DEVICES=nvidia.com/gpu=all \
      ubuntu nvidia-smi -L

The NVIDIA_VISIBLE_DEVICES environment variable indicates which devices to inject into the container and is explicitly set to nvidia.com/gpu=all.
NVIDIA_VISIBLE_DEVICES环境变量指示将哪些设备注入到容器中,并显式设置为nvidia.com/gpu=all

Setting the CDI Mode Explicitly

You can force CDI mode by explicitly setting the nvidia-container-runtime.mode option in the NVIDIA Container Runtime config to cdi:

$ sudo nvidia-ctk config --in-place --set nvidia-container-runtime.mode=cdi

In this case, the NVIDIA_VISIBLE_DEVICES environment variable is still used to select the devices to inject into the container, but the nvidia-container-runtime.modes.cdi.default-kind (with a default value of nvidia.com/gpu) is used to construct a fully-qualified CDI device name only when you specify a device index such as all, 0, or 1, and so on.

This means that if CDI mode is explicitly enabled, the following sample command has the same effect as specifying NVIDIA_VISIBLE_DEVICES=nvidia.com/gpu=all.

$ docker run --rm -ti --runtime=nvidia \
    -e NVIDIA_VISIBLE_DEVICES=all \
      ubuntu nvidia-smi -L