如何在 Linux 和 Docker 上为 NVIDIA 和 AMD GPU 设置 OpenCL?

OpenCL是“开放计算语言”的缩写形式,它是一种可跨多种平台使用的编程语言,主要用于加速计算,由于其跨多个平台的适用性的多样性,它通常被称为跨平台计算语言。您可以在OpenCL上编写程序并在各种设备上运行它们,包括CPU、GPU、FPGA等等。

在本指南中,我将只关注 GPU,我使用过 NVIDIA 和 AMD GPU,我将向您展示如何以最简单的方式让它们与 OpenCL 一起运行。

虽然我使用 Ubuntu 作为主机系统,但 Docker 部分适用于所有其他 Linux 发行版。

先决条件
  • 英伟达/AMD 显卡
  • Ubuntu Linux 20.04.2 LTS 桌面/服务器 64 位
  • Docker(用于特定应用程序的用途)

那么,让我们来看看细节吧!

为 NVIDIA GPU 设置 OpenCL

我将首先向您展示如何确保 OpenCL 在您的主 Ubuntu 桌面/服务器上运行。完成后,我将向您展示如何使用 NVIDIA GPU 运行 Docker 容器以实现相同目的。

在主机系统上运行 OpenCL

在全新的 Ubuntu 系统上,您需要先安装专有的 NVIDIA 驱动程序和 CUDA。后者可确保您获得与其捆绑的 OpenCL 框架。最后,安装clinfo程序以确保您已正确安装 OpenCL,详细显示您的 NVIDIA GPU 的 OpenCL 规格。让我们看看如何:

检查推荐的驱动程序

使用ubuntu-drivers devices命令获取推荐驱动程序的名称:

iborg@iborg-Nitro-AN515-52:~$ ubuntu-drivers devices
== /sys/devices/pci0000:00/0000:00:01.0/0000:01:00.0 ==
modalias : pci:v000010DEd00001C8Csv00001025sd00001265bc03sc00i00
vendor   : NVIDIA Corporation
model    : GP107M [GeForce GTX 1050 Ti Mobile]
driver   : nvidia-driver-460 - distro non-free recommended
driver   : nvidia-driver-418-server - distro non-free
driver   : nvidia-driver-390 - distro non-free
driver   : nvidia-driver-450-server - distro non-free
driver   : nvidia-driver-465 - distro non-free
driver   : nvidia-driver-460-server - distro non-free
driver   : xserver-xorg-video-nouveau - distro free builtin

上面,请注意推荐的驱动程序是nvidia-driver-460.

安装所有必要的软件包

clinfo因此,让我们安装推荐的驱动程序以及本节前面提到的CUDA 和软件包:

sudo apt install nvidia-driver-460 nvidia-cuda-toolkit clinfo

安装完以上三个包后,重启你的 Ubuntu 桌面/服务器。

验证您的 OpenCL 配置

iborg@iborg-Nitro-AN515-52:~$ clinfo
Number of platforms                               1
  Platform Name                                   NVIDIA CUDA
  Platform Vendor                                 NVIDIA Corporation
  Platform Version                                OpenCL 1.2 CUDA 9.1.84
  Platform Profile                                FULL_PROFILE
  Platform Extensions                             cl_khr_global_int32_base_atomics cl_khr_global_int32_extended_atomics cl_khr_local_int32_base_atomics cl_khr_local_int32_extended_atomics cl_khr_fp64 cl_khr_byte_addressable_store cl_khr_icd cl_khr_gl_sharing cl_nv_compiler_options cl_nv_device_attribute_query cl_nv_pragma_unroll cl_nv_copy_opts cl_nv_create_buffer
  Platform Extensions function suffix             NV

  Platform Name                                   NVIDIA CUDA
Number of devices                                 1
  Device Name                                     GeForce GTX 1050 Ti
  Device Vendor                                   NVIDIA Corporation
  Device Vendor ID                                0x10de
  Device Version                                  OpenCL 1.2 CUDA
  Driver Version                                  390.143
  Device OpenCL C Version                         OpenCL C 1.2 
  Device Type                                     GPU
  Device Topology (NV)                            PCI-E, 01:00.0
  Device Profile                                  FULL_PROFILE
  Device Available                                Yes
  Compiler Available                              Yes
  Linker Available                                Yes
  Max compute units                               6
  Max clock frequency                             1620MHz
  Compute Capability (NV)                         6.1
  Device Partition                                (core)
    Max number of sub-devices                     1
    Supported partition types                     None
  Max work item dimensions                        3
  Max work item sizes                             1024x1024x64
  Max work group size                             1024
  Preferred work group size multiple              32
  Warp size (NV)                                  32
  Preferred / native vector sizes                 
    char                                                 1 / 1       
    short                                                1 / 1       
    int                                                  1 / 1       
    long                                                 1 / 1       
    half                                                 0 / 0        (n/a)
    float                                                1 / 1       
    double                                               1 / 1        (cl_khr_fp64)
  Half-precision Floating-point support           (n/a)
  Single-precision Floating-point support         (core)
    Denormals                                     Yes
    Infinity and NANs                             Yes
    Round to nearest                              Yes
    Round to zero                                 Yes
    Round to infinity                             Yes
    IEEE754-2008 fused multiply-add               Yes
    Support is emulated in software               No
    Correctly-rounded divide and sqrt operations  Yes
  Double-precision Floating-point support         (cl_khr_fp64)
    Denormals                                     Yes
    Infinity and NANs                             Yes
    Round to nearest                              Yes
    Round to zero                                 Yes
    Round to infinity                             Yes
    IEEE754-2008 fused multiply-add               Yes
    Support is emulated in software               No
  Address bits                                    64, Little-Endian
  Global memory size                              4236312576 (3.945GiB)
  Error Correction support                        No
  Max memory allocation                           1059078144 (1010MiB)
  Unified memory for Host and Device              No
  Integrated memory (NV)                          No
  Minimum alignment for any data type             128 bytes
  Alignment of base address                       4096 bits (512 bytes)
  Global Memory cache type                        Read/Write
  Global Memory cache size                        98304 (96KiB)
  Global Memory cache line size                   128 bytes
  Image support                                   Yes
    Max number of samplers per kernel             32
    Max size for 1D images from buffer            134217728 pixels
    Max 1D or 2D image array size                 2048 images
    Max 2D image size                             16384x32768 pixels
    Max 3D image size                             16384x16384x16384 pixels
    Max number of read image args                 256
    Max number of write image args                16
  Local memory type                               Local
  Local memory size                               49152 (48KiB)
  Registers per block (NV)                        65536
  Max number of constant args                     9
  Max constant buffer size                        65536 (64KiB)
  Max size of kernel argument                     4352 (4.25KiB)
  Queue properties                                
    Out-of-order execution                        Yes
    Profiling                                     Yes
  Prefer user sync for interop                    No
  Profiling timer resolution                      1000ns
  Execution capabilities                          
    Run OpenCL kernels                            Yes
    Run native kernels                            No
    Kernel execution timeout (NV)                 Yes
  Concurrent copy and kernel execution (NV)       Yes
    Number of async copy engines                  2
  printf() buffer size                            1048576 (1024KiB)
  Built-in kernels                                
  Device Extensions                               cl_khr_global_int32_base_atomics cl_khr_global_int32_extended_atomics cl_khr_local_int32_base_atomics cl_khr_local_int32_extended_atomics cl_khr_fp64 cl_khr_byte_addressable_store cl_khr_icd cl_khr_gl_sharing cl_nv_compiler_options cl_nv_device_attribute_query cl_nv_pragma_unroll cl_nv_copy_opts cl_nv_create_buffer

NULL platform behavior
  clGetPlatformInfo(NULL, CL_PLATFORM_NAME, ...)  NVIDIA CUDA
  clGetDeviceIDs(NULL, CL_DEVICE_TYPE_ALL, ...)   Success [NV]
  clCreateContext(NULL, ...) [default]            Success [NV]
  clCreateContextFromType(NULL, CL_DEVICE_TYPE_DEFAULT)  No platform
  clCreateContextFromType(NULL, CL_DEVICE_TYPE_CPU)  No devices found in platform
  clCreateContextFromType(NULL, CL_DEVICE_TYPE_GPU)  No platform
  clCreateContextFromType(NULL, CL_DEVICE_TYPE_ACCELERATOR)  No devices found in platform
  clCreateContextFromType(NULL, CL_DEVICE_TYPE_CUSTOM)  Invalid device type for platform
  clCreateContextFromType(NULL, CL_DEVICE_TYPE_ALL)  No platform

ICD loader properties
  ICD loader Name                                 OpenCL ICD Loader
  ICD loader Vendor                               OCL Icd free software
  ICD loader Version                              2.2.11
  ICD loader Profile                              OpenCL 2.1

请注意,这里只有平台名称是“NVIDIA CUDA”。但是CUDA 和 OpenCL 是不同的。

就是这样!现在,您可以在主机系统上使用 NVIDIA GPU 运行 OpenCL 应用程序!

用于 NVIDIA GPU 的 Docker 上的 OpenCL

现在您已经在裸机系统上启动并运行了 OpenCL,让我们看看如何在 Docker 容器上安装它!

安装 NVIDIA 容器运行时

在这里,您必须另外安装该nvidia-container-runtime软件包。

为了能够安装它,您必须首先添加存储库详细信息。如果您的系统上还没有安装 Curl,请确保您已安装它。

sudo apt install curl
curl -s -L https://nvidia.github.io/nvidia-container-runtime/gpgkey | \
  sudo apt-key add -
distribution=$(. /etc/os-release;echo $ID$VERSION_ID)
curl -s -L https://nvidia.github.io/nvidia-container-runtime/$distribution/nvidia-container-runtime.list | \
  sudo tee /etc/apt/sources.list.d/nvidia-container-runtime.list
sudo apt update
sudo apt install nvidia-container-runtime

创建 Dockerfile

有必要将您在主机系统上所做的一切复制到一个全新的映像上,以便您可以使用它在容器上启动我们的自定义 OpenCL 应用程序(稍后将详细说明)。

为您的 NVIDIA GPU OpenCL 项目创建一个新目录并进入其中:

mkdir nvidia-opencl
cd nvidia-opencl

使用您喜欢的文本编辑器(Vim/Nano 或任何其他)创建以下 Dockerfile 并保存:

FROM ubuntu:20.04
ARG DEBIAN_FRONTEND=noninteractive
RUN apt-get update && apt-get -y upgrade \
  && apt-get install -y \
    apt-utils \
    unzip \
    tar \
    curl \
    xz-utils \
    ocl-icd-libopencl1 \
    opencl-headers \
    clinfo \
    ;

RUN mkdir -p /etc/OpenCL/vendors && \
    echo "libnvidia-opencl.so.1" > /etc/OpenCL/vendors/nvidia.icd
ENV NVIDIA_VISIBLE_DEVICES all
ENV NVIDIA_DRIVER_CAPABILITIES compute,utility

构建 Dockerfile

因此,现在您已经拥有了开始所需的 Dockerfile,让我们构建它。我将图像命名为nvidia-opencl:

docker build -t nvidia-opencl .

启动 OpenCL 容器

基于您刚刚构建的新镜像,是时候启动新的 OpenCL 容器了!

首先,使用以下命令允许本地计算机上的 Linux 用户名连接到 X windows 显示器:

xhost +local:username

使用以下命令,现在可以根据刚刚创建的新镜像直接进入本地容器的shell:

docker run --rm -it --gpus all -v /tmp/.X11-unix:/tmp/.X11-unix -e DISPLAY=$DISPLAY nvidia-opencl

在 Docker 上验证您的 OpenCL 配置

现在您在容器外壳中,您可以运行clinfo命令来验证您的 OpenCL 配置,就像您在裸机主机系统上所做的那样:

root@7b39b04c019f:/# clinfo
Number of platforms                               1
  Platform Name                                   NVIDIA CUDA
  Platform Vendor                                 NVIDIA Corporation
  Platform Version                                OpenCL 1.2 CUDA 9.1.84
  Platform Profile                                FULL_PROFILE
  Platform Extensions                             cl_khr_global_int32_base_atomics cl_khr_global_int32_extended_atomics cl_khr_local_int32_base_atomics cl_khr_local_int32_extended_atomics cl_khr_fp64 cl_khr_byte_addressable_store cl_khr_icd cl_khr_gl_sharing cl_nv_compiler_options cl_nv_device_attribute_query cl_nv_pragma_unroll cl_nv_copy_opts cl_nv_create_buffer
  Platform Extensions function suffix             NV

  Platform Name                                   NVIDIA CUDA
Number of devices                                 1
  Device Name                                     GeForce GTX 1050 Ti
  Device Vendor                                   NVIDIA Corporation
  Device Vendor ID                                0x10de
  Device Version                                  OpenCL 1.2 CUDA
  Driver Version                                  390.143
  Device OpenCL C Version                         OpenCL C 1.2 
  Device Type                                     GPU
  Device Topology (NV)                            PCI-E, 01:00.0
  Device Profile                                  FULL_PROFILE
  Device Available                                Yes
  Compiler Available                              Yes
  Linker Available                                Yes
  Max compute units                               6
  Max clock frequency                             1620MHz
  Compute Capability (NV)                         6.1
  Device Partition                                (core)
    Max number of sub-devices                     1
    Supported partition types                     None
    Supported affinity domains                    (n/a)
  Max work item dimensions                        3
  Max work item sizes                             1024x1024x64
  Max work group size                             1024
  Preferred work group size multiple              32
  Warp size (NV)                                  32
  Preferred / native vector sizes                 
    char                                                 1 / 1       
    short                                                1 / 1       
    int                                                  1 / 1       
    long                                                 1 / 1       
    half                                                 0 / 0        (n/a)
    float                                                1 / 1       
    double                                               1 / 1        (cl_khr_fp64)
  Half-precision Floating-point support           (n/a)
  Single-precision Floating-point support         (core)
    Denormals                                     Yes
    Infinity and NANs                             Yes
    Round to nearest                              Yes
    Round to zero                                 Yes
    Round to infinity                             Yes
    IEEE754-2008 fused multiply-add               Yes
    Support is emulated in software               No
    Correctly-rounded divide and sqrt operations  Yes
  Double-precision Floating-point support         (cl_khr_fp64)
    Denormals                                     Yes
    Infinity and NANs                             Yes
    Round to nearest                              Yes
    Round to zero                                 Yes
    Round to infinity                             Yes
    IEEE754-2008 fused multiply-add               Yes
    Support is emulated in software               No
  Address bits                                    64, Little-Endian
  Global memory size                              4236312576 (3.945GiB)
  Error Correction support                        No
  Max memory allocation                           1059078144 (1010MiB)
  Unified memory for Host and Device              No
  Integrated memory (NV)                          No
  Minimum alignment for any data type             128 bytes
  Alignment of base address                       4096 bits (512 bytes)
  Global Memory cache type                        Read/Write
  Global Memory cache size                        98304 (96KiB)
  Global Memory cache line size                   128 bytes
  Image support                                   Yes
    Max number of samplers per kernel             32
    Max size for 1D images from buffer            134217728 pixels
    Max 1D or 2D image array size                 2048 images
    Max 2D image size                             16384x32768 pixels
    Max 3D image size                             16384x16384x16384 pixels
    Max number of read image args                 256
    Max number of write image args                16
  Local memory type                               Local
  Local memory size                               49152 (48KiB)
  Registers per block (NV)                        65536
  Max number of constant args                     9
  Max constant buffer size                        65536 (64KiB)
  Max size of kernel argument                     4352 (4.25KiB)
  Queue properties                                
    Out-of-order execution                        Yes
    Profiling                                     Yes
  Prefer user sync for interop                    No
  Profiling timer resolution                      1000ns
  Execution capabilities                          
    Run OpenCL kernels                            Yes
    Run native kernels                            No
    Kernel execution timeout (NV)                 Yes
  Concurrent copy and kernel execution (NV)       Yes
    Number of async copy engines                  2
  printf() buffer size                            1048576 (1024KiB)
  Built-in kernels                                (n/a)
  Device Extensions                               cl_khr_global_int32_base_atomics cl_khr_global_int32_extended_atomics cl_khr_local_int32_base_atomics cl_khr_local_int32_extended_atomics cl_khr_fp64 cl_khr_byte_addressable_store cl_khr_icd cl_khr_gl_sharing cl_nv_compiler_options cl_nv_device_attribute_query cl_nv_pragma_unroll cl_nv_copy_opts cl_nv_create_buffer

NULL platform behavior
  clGetPlatformInfo(NULL, CL_PLATFORM_NAME, ...)  NVIDIA CUDA
  clGetDeviceIDs(NULL, CL_DEVICE_TYPE_ALL, ...)   Success [NV]
  clCreateContext(NULL, ...) [default]            Success [NV]
  clCreateContextFromType(NULL, CL_DEVICE_TYPE_DEFAULT)  No platform
  clCreateContextFromType(NULL, CL_DEVICE_TYPE_CPU)  No devices found in platform
  clCreateContextFromType(NULL, CL_DEVICE_TYPE_GPU)  No platform
  clCreateContextFromType(NULL, CL_DEVICE_TYPE_ACCELERATOR)  No devices found in platform
  clCreateContextFromType(NULL, CL_DEVICE_TYPE_CUSTOM)  Invalid device type for platform
  clCreateContextFromType(NULL, CL_DEVICE_TYPE_ALL)  No platform

ICD loader properties
  ICD loader Name                                 OpenCL ICD Loader
  ICD loader Vendor                               OCL Icd free software
  ICD loader Version                              2.2.11
  ICD loader Profile                              OpenCL 2.1
root@7b39b04c019f:/#

这是什么意思?这意味着现在您可以在此容器内运行任何 OpenCL 应用程序!你只需要重新修改 Dockerfile 就可以了。

您还可以使用需要 OpenCL 后端的 Python 应用程序。查看我之前的报道,它可以作为本文的便利伴侣。您可能想检查一下并使用 Dockerfiles。

为 AMD GPU 设置 OpenCL

我将首先向您展示如何确保 OpenCL 在您的主 Ubuntu 桌面/服务器上运行。完成后,我将向您展示如何使用 AMD GPU 为相同目的运行 Docker 容器。

在主机系统上运行 OpenCL

在全新的 Ubuntu 系统上,您需要先从AMD 支持页面下载“AMDGPU 驱动程序” 。对于面向未来的配置,您只需在获得安装存档 (tar.xz) 后为旧版和较新的 AMD GPU 安装 OpenCL。

最后,安装clinfo程序以确保您已正确安装 OpenCL,详细显示您的 AMD GPU 的 OpenCL 规格。但整个过程可能比预期的要复杂一些。让我们看看如何。

使用 Curl 下载 AMDGPU 驱动程序

浏览 AMD 支持页面并使用 Curl 下载相关驱动程序。确保已安装 Curl。

sudo apt install curl
curl -e https://drivers.amd.com/drivers/linux -O https://drivers.amd.com/drivers/linux/amdgpu-pro-21.10-1247438-ubuntu-20.04.tar.xz

安装、异常及其解决方法

提取存档:

tar -Jxvf amdgpu-pro-21.10-1247438-ubuntu-20.04.tar.xz

进入新目录:

cd amdgpu-pro-21.10-1247438-ubuntu-20.04

现在,我将为旧版和新版 GPU 安装 OpenCL:

./amdgpu-install --opencl=legacy,rocr --headless --no-dkms

有关其用法的完整概述,您可以使用该命令./amdgpu-install -h来了解脚本的基本工作原理。它类似于命令的 man 条目。该–headless选项仅指定 OpenCL 支持并–no-dkms告诉它不要将amdgpu-dkms软件包安装amdgpu-dkms-firmware到内核中。你不需要那个。

很长一段时间以来,人们发现即使您指定了–no-dkms选项,脚本也不会费心遵守并继续安装那些不必要的软件包。添加更多,如果我允许amdgpu-dkms安装和修改内核配置,系统将拒绝重新启动或关闭!这发生在我从 Ubuntu 存储库收到内核更新之后。

在这种情况下,这就是我所做的:

我使用 , 手动安装了以下软件包dpkg -i package-name.deb,出现在提取的目录中:

amdgpu-pin_21.10-1247438_all.deb
amdgpu-core_21.10-1247438_all.deb
amdgpu-pro-core_21.10-1247438_all.deb
libdrm-amdgpu-common_1.0.0-1247438_all.deb
libdrm2-amdgpu_2.4.100-1247438_amd64.deb
libdrm-amdgpu-amdgpu1_2.4.100-1247438_amd64.deb
hsakmt-roct-amdgpu_1.0.9-1247438_amd64.deb
hsa-runtime-rocr-amdgpu_1.3.0-1247438_amd64.deb
comgr-amdgpu-pro_2.0.0-1247438_amd64.deb
hip-rocr-amdgpu-pro_21.10-1247438_amd64.deb
ocl-icd-libopencl1-amdgpu-pro_21.10-1247438_amd64.deb
clinfo-amdgpu-pro_21.10-1247438_amd64.deb
opencl-rocr-amdgpu-pro_21.10-1247438_amd64.deb
libllvm11.0-amdgpu_11.0-1247438_amd64.deb

这确保了这一点,amdgpu-dkms并且amdgpu-dkms-firmware可以避免并且保持内核不变。另外,请注意,我已经下载了较旧的 21.10 驱动程序,即使更新和最新的 21.30 版本可用。原因是后者拒绝在我稍后运行时通过给出“HSA 错误”来识别我的 Radeon VII GPU clinfo:

HSA Error: Incompatible kernel and userspace, Vega 20 [Radeon VII] disabled. Upgrade amdgpu.

处理完这些异常后,我能够clinfo正确报告我的 GPU。

安装 clinfo 包

clinfo像之前为 NVIDIA GPU安装的软件包一样:

sudo apt install clinfo

验证您的 OpenCL 配置

avimanyu@GizmoQuest-Computing-Lab:~$ clinfo
Number of platforms                               1
  Platform Name                                   AMD Accelerated Parallel Processing
  Platform Vendor                                 Advanced Micro Devices, Inc.
  Platform Version                                OpenCL 2.0 AMD-APP (3246.0)
  Platform Profile                                FULL_PROFILE
  Platform Extensions                             cl_khr_icd cl_amd_event_callback 
  Platform Extensions function suffix             AMD

  Platform Name                                   AMD Accelerated Parallel Processing
Number of devices                                 1
  Device Name                                     gfx906:sramecc-:xnack-
  Device Vendor                                   Advanced Micro Devices, Inc.
  Device Vendor ID                                0x1002
  Device Version                                  OpenCL 2.0 
  Driver Version                                  3246.0 (HSA1.1,LC)
  Device OpenCL C Version                         OpenCL C 2.0 
  Device Type                                     GPU
  Device Board Name (AMD)                         Vega 20 [Radeon VII]
  Device Topology (AMD)                           PCI-E, 0a:00.0
  Device Profile                                  FULL_PROFILE
  Device Available                                Yes
  Compiler Available                              Yes
  Linker Available                                Yes
  Max compute units                               60
  SIMD per compute unit (AMD)                     4
  SIMD width (AMD)                                16
  SIMD instruction width (AMD)                    1
  Max clock frequency                             1801MHz
  Graphics IP (AMD)                               9.0
  Device Partition                                (core)
    Max number of sub-devices                     60
    Supported partition types                     None
    Supported affinity domains                    (n/a)
  Max work item dimensions                        3
  Max work item sizes                             1024x1024x1024
  Max work group size                             256
  Preferred work group size (AMD)                 256
  Max work group size (AMD)                       1024
  Preferred work group size multiple              64
  Wavefront width (AMD)                           64
  Preferred / native vector sizes                 
    char                                                 4 / 4       
    short                                                2 / 2       
    int                                                  1 / 1       
    long                                                 1 / 1       
    half                                                 1 / 1        (cl_khr_fp16)
    float                                                1 / 1       
    double                                               1 / 1        (cl_khr_fp64)
  Half-precision Floating-point support           (cl_khr_fp16)
    Denormals                                     No
    Infinity and NANs                             No
    Round to nearest                              No
    Round to zero                                 No
    Round to infinity                             No
    IEEE754-2008 fused multiply-add               No
    Support is emulated in software               No
  Single-precision Floating-point support         (core)
    Denormals                                     Yes
    Infinity and NANs                             Yes
    Round to nearest                              Yes
    Round to zero                                 Yes
    Round to infinity                             Yes
    IEEE754-2008 fused multiply-add               Yes
    Support is emulated in software               No
    Correctly-rounded divide and sqrt operations  Yes
  Double-precision Floating-point support         (cl_khr_fp64)
    Denormals                                     Yes
    Infinity and NANs                             Yes
    Round to nearest                              Yes
    Round to zero                                 Yes
    Round to infinity                             Yes
    IEEE754-2008 fused multiply-add               Yes
    Support is emulated in software               No
  Address bits                                    64, Little-Endian
  Global memory size                              17163091968 (15.98GiB)
  Global free memory (AMD)                        16760832 (15.98GiB)
  Global memory channels (AMD)                    128
  Global memory banks per channel (AMD)           4
  Global memory bank width (AMD)                  256 bytes
  Error Correction support                        No
  Max memory allocation                           14588628168 (13.59GiB)
  Unified memory for Host and Device              No
  Shared Virtual Memory (SVM) capabilities        (core)
    Coarse-grained buffer sharing                 Yes
    Fine-grained buffer sharing                   Yes
    Fine-grained system sharing                   No
    Atomics                                       No
  Minimum alignment for any data type             128 bytes
  Alignment of base address                       1024 bits (128 bytes)
  Preferred alignment for atomics                 
    SVM                                           0 bytes
    Global                                        0 bytes
    Local                                         0 bytes
  Max size for global variable                    14588628168 (13.59GiB)
  Preferred total size of global vars             17163091968 (15.98GiB)
  Global Memory cache type                        Read/Write
  Global Memory cache size                        16384 (16KiB)
  Global Memory cache line size                   64 bytes
  Image support                                   Yes
    Max number of samplers per kernel             26287
    Max size for 1D images from buffer            134217728 pixels
    Max 1D or 2D image array size                 8192 images
    Base address alignment for 2D image buffers   256 bytes
    Pitch alignment for 2D image buffers          256 pixels
    Max 2D image size                             16384x16384 pixels
    Max 3D image size                             16384x16384x8192 pixels
    Max number of read image args                 128
    Max number of write image args                8
    Max number of read/write image args           64
  Max number of pipe args                         16
  Max active pipe reservations                    16
  Max pipe packet size                            1703726280 (1.587GiB)
  Local memory type                               Local
  Local memory size                               65536 (64KiB)
  Local memory syze per CU (AMD)                  65536 (64KiB)
  Local memory banks (AMD)                        32
  Max number of constant args                     8
  Max constant buffer size                        14588628168 (13.59GiB)
  Preferred constant buffer size (AMD)            16384 (16KiB)
  Max size of kernel argument                     1024
  Queue properties (on host)                      
    Out-of-order execution                        No
    Profiling                                     Yes
  Queue properties (on device)                    
    Out-of-order execution                        Yes
    Profiling                                     Yes
    Preferred size                                262144 (256KiB)
    Max size                                      8388608 (8MiB)
  Max queues on device                            1
  Max events on device                            1024
  Prefer user sync for interop                    Yes
  Number of P2P devices (AMD)                     0
  P2P devices (AMD)                               <printDeviceInfo:147: get number of CL_DEVICE_P2P_DEVICES_AMD : error -30>
  Profiling timer resolution                      1ns
  Profiling timer offset since Epoch (AMD)        0ns (Thu Jan  1 05:30:00 1970)
  Execution capabilities                          
    Run OpenCL kernels                            Yes
    Run native kernels                            No
    Thread trace supported (AMD)                  No
    Number of async queues (AMD)                  8
    Max real-time compute queues (AMD)            8
    Max real-time compute units (AMD)             60
  printf() buffer size                            4194304 (4MiB)
  Built-in kernels                                (n/a)
  Device Extensions                               cl_khr_fp64 cl_khr_global_int32_base_atomics cl_khr_global_int32_extended_atomics cl_khr_local_int32_base_atomics cl_khr_local_int32_extended_atomics cl_khr_int64_base_atomics cl_khr_int64_extended_atomics cl_khr_3d_image_writes cl_khr_byte_addressable_store cl_khr_fp16 cl_khr_gl_sharing cl_amd_device_attribute_query cl_amd_media_ops cl_amd_media_ops2 cl_khr_image2d_from_buffer cl_khr_subgroups cl_khr_depth_images cl_amd_copy_buffer_p2p cl_amd_assembly_program 

NULL platform behavior
  clGetPlatformInfo(NULL, CL_PLATFORM_NAME, ...)  No platform
  clGetDeviceIDs(NULL, CL_DEVICE_TYPE_ALL, ...)   No platform
  clCreateContext(NULL, ...) [default]            No platform
  clCreateContext(NULL, ...) [other]              Success [AMD]
  clCreateContextFromType(NULL, CL_DEVICE_TYPE_DEFAULT)  Success (1)
    Platform Name                                 AMD Accelerated Parallel Processing
    Device Name                                   gfx906:sramecc-:xnack-
  clCreateContextFromType(NULL, CL_DEVICE_TYPE_CPU)  No devices found in platform
  clCreateContextFromType(NULL, CL_DEVICE_TYPE_GPU)  Success (1)
    Platform Name                                 AMD Accelerated Parallel Processing
    Device Name                                   gfx906:sramecc-:xnack-
  clCreateContextFromType(NULL, CL_DEVICE_TYPE_ACCELERATOR)  No devices found in platform
  clCreateContextFromType(NULL, CL_DEVICE_TYPE_CUSTOM)  No devices found in platform
  clCreateContextFromType(NULL, CL_DEVICE_TYPE_ALL)  Success (1)
    Platform Name                                 AMD Accelerated Parallel Processing
    Device Name                                   gfx906:sramecc-:xnack-

因此,现在您可以在主机系统上使用 AMD GPU 运行 OpenCL 应用程序!

用于 AMD GPU 的 Docker 上的 OpenCL

通过 Docker 容器做同样的事情怎么样?让我们看看它与 NVIDIA GPU 的对比有多大。

创建 Dockerfile

为您的 AMD GPU OpenCL 项目创建一个新目录并进入其中:

mkdir amd-opencl
cd amd-opencl

使用您喜欢的文本编辑器(Vim/Nano 或任何其他)创建以下 Dockerfile 并保存:

FROM ubuntu:20.04
ARG DEBIAN_FRONTEND=noninteractive
RUN apt-get update && apt-get -y upgrade \
  && apt-get install -y \
    initramfs-tools \
    apt-utils \
    unzip \
    tar \
    curl \
    xz-utils \
    ocl-icd-libopencl1 \
    opencl-headers \
    clinfo \
    ;

ARG AMD_DRIVER=amdgpu-pro-21.10-1247438-ubuntu-20.04.tar.xz
ARG AMD_DRIVER_URL=https://drivers.amd.com/drivers/linux
RUN mkdir -p /tmp/opencl-driver-amd
WORKDIR /tmp/opencl-driver-amd
RUN curl --referer $AMD_DRIVER_URL -O $AMD_DRIVER_URL/$AMD_DRIVER; \
    tar -Jxvf $AMD_DRIVER; \
    cd amdgpu-pro-*; \
    ./amdgpu-install --opencl=legacy,rocr --headless --no-dkms -y; \
    rm -rf /tmp/opencl-driver-amd;

RUN mkdir -p /etc/OpenCL/vendors && \
    echo "libamdocl64.so" > /etc/OpenCL/vendors/amdocl64.icd
RUN ln -s /usr/lib/x86_64-linux-gnu/libOpenCL.so.1 /usr/lib/libOpenCL.so
WORKDIR /

我不得不添加这个initramfs-tools包,因为amdgpu-dkmsandamdgpu-dkms-firmware仍然会被安装。我保持这种方式是因为在这种情况下,我之前提到的重启和关闭问题与容器无关。

或者,您仍然可以使用dpkg -iDockerfile 中的方法。

构建 Dockerfile

因此,现在您已经拥有了开始所需的 Dockerfile,让我们构建它。我将图像命名为amd-opencl:

docker build -t amd-opencl .

将您的用户名添加到视频和渲染组

为了使 AMD GPU Docker 容器完美运行,最好将您的用户名添加到视频和渲染组中:

sudo usermod -a -G video $LOGNAME
sudo usermod -a -G render $LOGNAME

启动 OpenCL 容器

基于您刚刚构建的新镜像,是时候启动新的 OpenCL 容器了!

使用以下命令允许本地计算机上的 Linux 用户名连接到 X windows 显示器:

xhost +local:username

使用以下命令,现在可以根据刚刚创建的新镜像直接进入本地容器的shell:

docker run --rm -it --device=/dev/kfd --device=/dev/dri --group-add video --group-add render -v /tmp/.X11-unix:/tmp/.X11-unix -e DISPLAY=$DISPLAY amd-opencl

在 Docker 上验证您的 OpenCL 配置

现在您在容器外壳中,您可以运行clinfo命令来验证您的 OpenCL 配置,就像您在裸机主机系统上所做的那样:

root@00ec73e147bc:/# clinfo
Platform Name                                   AMD Accelerated Parallel Processing
Number of devices                                 1
  Device Name                                     gfx906:sramecc-:xnack-
  Device Vendor                                   Advanced Micro Devices, Inc.
  Device Vendor ID                                0x1002
  Device Version                                  OpenCL 2.0 
  Driver Version                                  3246.0 (HSA1.1,LC)
  Device OpenCL C Version                         OpenCL C 2.0 
  Device Type                                     GPU
  Device Board Name (AMD)                         Device 66af
  Device Topology (AMD)                           PCI-E, 0a:00.0
  Device Profile                                  FULL_PROFILE
  Device Available                                Yes
  Compiler Available                              Yes
  Linker Available                                Yes
  Max compute units                               60
  SIMD per compute unit (AMD)                     4
  SIMD width (AMD)                                16
  SIMD instruction width (AMD)                    1
  Max clock frequency                             1801MHz
  Graphics IP (AMD)                               9.0
  Device Partition                                (core)
    Max number of sub-devices                     60
    Supported partition types                     None
    Supported affinity domains                    (n/a)
  Max work item dimensions                        3
  Max work item sizes                             1024x1024x1024
  Max work group size                             256
  Preferred work group size (AMD)                 256
  Max work group size (AMD)                       1024
  Preferred work group size multiple              64
  Wavefront width (AMD)                           64
  Preferred / native vector sizes                 
    char                                                 4 / 4       
    short                                                2 / 2       
    int                                                  1 / 1       
    long                                                 1 / 1       
    half                                                 1 / 1        (cl_khr_fp16)
    float                                                1 / 1       
    double                                               1 / 1        (cl_khr_fp64)
  Half-precision Floating-point support           (cl_khr_fp16)
    Denormals                                     No
    Infinity and NANs                             No
    Round to nearest                              No
    Round to zero                                 No
    Round to infinity                             No
    IEEE754-2008 fused multiply-add               No
    Support is emulated in software               No
  Single-precision Floating-point support         (core)
    Denormals                                     Yes
    Infinity and NANs                             Yes
    Round to nearest                              Yes
    Round to zero                                 Yes
    Round to infinity                             Yes
    IEEE754-2008 fused multiply-add               Yes
    Support is emulated in software               No
    Correctly-rounded divide and sqrt operations  Yes
  Double-precision Floating-point support         (cl_khr_fp64)
    Denormals                                     Yes
    Infinity and NANs                             Yes
    Round to nearest                              Yes
    Round to zero                                 Yes
    Round to infinity                             Yes
    IEEE754-2008 fused multiply-add               Yes
    Support is emulated in software               No
  Address bits                                    64, Little-Endian
  Global memory size                              17163091968 (15.98GiB)
  Global free memory (AMD)                        16760832 (15.98GiB)
  Global memory channels (AMD)                    128
  Global memory banks per channel (AMD)           4
  Global memory bank width (AMD)                  256 bytes
  Error Correction support                        No
  Max memory allocation                           14588628168 (13.59GiB)
  Unified memory for Host and Device              No
  Shared Virtual Memory (SVM) capabilities        (core)
    Coarse-grained buffer sharing                 Yes
    Fine-grained buffer sharing                   Yes
    Fine-grained system sharing                   No
    Atomics                                       No
  Minimum alignment for any data type             128 bytes
  Alignment of base address                       1024 bits (128 bytes)
  Preferred alignment for atomics                 
    SVM                                           0 bytes
    Global                                        0 bytes
    Local                                         0 bytes
  Max size for global variable                    14588628168 (13.59GiB)
  Preferred total size of global vars             17163091968 (15.98GiB)
  Global Memory cache type                        Read/Write
  Global Memory cache size                        16384 (16KiB)
  Global Memory cache line size                   64 bytes
  Image support                                   Yes
    Max number of samplers per kernel             26287
    Max size for 1D images from buffer            134217728 pixels
    Max 1D or 2D image array size                 8192 images
    Base address alignment for 2D image buffers   256 bytes
    Pitch alignment for 2D image buffers          256 pixels
    Max 2D image size                             16384x16384 pixels
    Max 3D image size                             16384x16384x8192 pixels
    Max number of read image args                 128
    Max number of write image args                8
    Max number of read/write image args           64
  Max number of pipe args                         16
  Max active pipe reservations                    16
  Max pipe packet size                            1703726280 (1.587GiB)
  Local memory type                               Local
  Local memory size                               65536 (64KiB)
  Local memory syze per CU (AMD)                  65536 (64KiB)
  Local memory banks (AMD)                        32
  Max number of constant args                     8
  Max constant buffer size                        14588628168 (13.59GiB)
  Preferred constant buffer size (AMD)            16384 (16KiB)
  Max size of kernel argument                     1024
  Queue properties (on host)                      
    Out-of-order execution                        No
    Profiling                                     Yes
  Queue properties (on device)                    
    Out-of-order execution                        Yes
    Profiling                                     Yes
    Preferred size                                262144 (256KiB)
    Max size                                      8388608 (8MiB)
  Max queues on device                            1
  Max events on device                            1024
  Prefer user sync for interop                    Yes
  Number of P2P devices (AMD)                     0
  P2P devices (AMD)                               <printDeviceInfo:147: get number of CL_DEVICE_P2P_DEVICES_AMD : error -30>
  Profiling timer resolution                      1ns
  Profiling timer offset since Epoch (AMD)        0ns (Thu Jan  1 00:00:00 1970)
  Execution capabilities                          
    Run OpenCL kernels                            Yes
    Run native kernels                            No
    Thread trace supported (AMD)                  No
    Number of async queues (AMD)                  8
    Max real-time compute queues (AMD)            8
    Max real-time compute units (AMD)             60
  printf() buffer size                            4194304 (4MiB)
  Built-in kernels                                (n/a)
  Device Extensions                               cl_khr_fp64 cl_khr_global_int32_base_atomics cl_khr_global_int32_extended_atomics cl_khr_local_int32_base_atomics cl_khr_local_int32_extended_atomics cl_khr_int64_base_atomics cl_khr_int64_extended_atomics cl_khr_3d_image_writes cl_khr_byte_addressable_store cl_khr_fp16 cl_khr_gl_sharing cl_amd_device_attribute_query cl_amd_media_ops cl_amd_media_ops2 cl_khr_image2d_from_buffer cl_khr_subgroups cl_khr_depth_images cl_amd_copy_buffer_p2p cl_amd_assembly_program

NULL platform behavior
  clGetPlatformInfo(NULL, CL_PLATFORM_NAME, ...)  No platform
  clGetDeviceIDs(NULL, CL_DEVICE_TYPE_ALL, ...)   No platform
  clCreateContext(NULL, ...) [default]            No platform
  clCreateContext(NULL, ...) [other]              Success [AMD]
  clCreateContextFromType(NULL, CL_DEVICE_TYPE_DEFAULT)  Success (1)
    Platform Name                                 AMD Accelerated Parallel Processing
    Device Name                                   gfx906:sramecc-:xnack-
  clCreateContextFromType(NULL, CL_DEVICE_TYPE_CPU)  No devices found in platform
  clCreateContextFromType(NULL, CL_DEVICE_TYPE_GPU)  Success (1)
    Platform Name                                 AMD Accelerated Parallel Processing
    Device Name                                   gfx906:sramecc-:xnack-
  clCreateContextFromType(NULL, CL_DEVICE_TYPE_ACCELERATOR)  No devices found in platform
  clCreateContextFromType(NULL, CL_DEVICE_TYPE_CUSTOM)  No devices found in platform
  clCreateContextFromType(NULL, CL_DEVICE_TYPE_ALL)  Success (1)
    Platform Name                                 AMD Accelerated Parallel Processing
    Device Name                                   gfx906:sramecc-:xnack-
root@00ec73e147bc:/#

这就是您可以在 AMD GPU 容器中运行 OpenCL 应用程序的方式!

请注意,xhost每次您想从新终端运行它们时,都需要用于 NVIDIA 和 AMD GPU 容器的命令。

如果您碰巧在单个系统上拥有多个 GPU,并且希望具体了解如何运行容器,您也可以这样做。继续阅读。

英伟达 GPU

根据clinfoNVIDIA GPU 信息的报告方式,它们在 Docker 上被分类为0、1等2。因此,假设您有三个 NVIDIA GPU,并且希望容器只看到 GPU 0(第一个),则必须将相应的命令修改为:

docker run --rm -it --gpus 0 -v /tmp/.X11-unix:/tmp/.X11-unix -e DISPLAY=$DISPLAY nvidia-opencl
AMD GPU

同样,根据clinfoAMD GPU 信息的报告方式,它们在 Docker 上被分类为/dev/dri/card0、/dev/dri/card1等/dev/dri/card2。因此,假设您有三个 AMD GPU,并且希望容器只看到第一个,请改用以下命令:

docker run --rm -it --device=/dev/kfd --device=/dev/dri/card0 --device=/dev/dri/renderD128 --group-add video --group-add render -v /tmp/.X11-unix:/tmp/.X11-unix -e DISPLAY=$DISPLAY amd-opencl

按照上面的命令,注意renderD128对应card0,两者都与第一个 AMD GPU 有关。在同一行中,renderD129将对应于card1第二个 AMD GPU,依此类推。“renderD”值是递增的,因此对于第三个 GPU,它将renderD130对应于card2. ls -l /dev/dri/by-path您可以通过运行命令详细了解这些指标。

THE END
点赞0 分享
评论 抢沙发

请登录后发表评论

    暂无评论内容