Cuda toolkit documentation v10.0.130

White Papers

Floating Point and IEEE 754
A number of issues related to floating point accuracy and compliance are
a frequent source of confusion on both CPUs and GPUs. The purpose of this
white paper is to discuss the most common issues related to NVIDIA GPUs
and to supplement the documentation in the CUDA C++ Programming Guide.
Incomplete-LU and Cholesky Preconditioned Iterative Methods
In this white paper we show how to use the
cuSPARSE and cuBLAS libraries to achieve a 2x speedup over CPU in the
incomplete-LU and Cholesky preconditioned iterative methods. We focus on
the Bi-Conjugate Gradient Stabilized and Conjugate Gradient iterative
methods, that can be used to solve large sparse nonsymmetric and
symmetric positive definite linear systems, respectively. Also, we
comment on the parallel sparse triangular solve, which is an essential
building block in these algorithms.

3. Сборка docker-образа с CUDA 10.X и cuDNN 7.6

В проекте доступны 2 вида Dockerfile: и . Образы с меткой используются для запуска уже готовых проектов на GPU, а с меткой — для сборки проектов из исходников с поддержкой GPU (подробно про разницу между ними можно посмотреть в репозитории ).

Для сборки docker-образа с и , находясь в папке с проектом, нужно выполнить ( — использовать файл в качестве Dockerfile для сборки образа, — запуск в терминале, — директория, из которой вызывается docker build (точка — значит в текущей директории находятся все файлы для образа), — метка образа и его версия):

sudo docker build -f Dockerfile_cuda10.2_runtime -t cuda10.2_cudnn7.6:runtime .

В качестве базовой ОС для образа используется Ubuntu 19.10.

Для сборки docker-образов с другими версиями CUDA в проекте присутствуют соответствующие Dockerfile:

  • CUDA 10.0: Dockerfile_cuda10.0_runtime и Dockerfile_cuda10.0_devel
  • CUDA 10.1: Dockerfile_cuda10.1_runtime и Dockerfile_cuda10.1_devel
  • CUDA 10.2: Dockerfile_cuda10.2_runtime и Dockerfile_cuda10.2_devel

После успешной сборки, проверить работоспособность образа можно следующим образом ( — предоставить контейнеру доступ ко всем GPU на хост-машине, — запуск терминала, — интерактивный режим, — удалить контейнер после завершения его работы):

sudo docker run --gpus all -ti --rm cuda10.2_cudnn7.6:runtime nvidia-smi

Результат должен быть такой же, как и в .

Примечание: размер собранного docker-образа с и равен 1.3-1.8 Гб для и 3.1-3.8 Гб для .

Сборка пользовательского docker-образа с CUDA 10.X и cuDNN 7.6

Для сборки любого пользовательского docker-образа с библиотеками и нужно:

  1. Изменить образ, на основе которого собирается пользовательский образ, на созданный ранее
  2. Предусмотреть в исходном коде запускаемого в docker-контейнере проекта обнаружение и использование GPU

Что бы docker-образ имел доступ к видеокарте, при запуске образа необходимо передать параметр , например:

sudo docker run --gpus all -ti --rm my_image:version

Параметр предоставит контейнеру доступ сразу ко всем имеющимся на хост-машине видеокартам. Что бы указать, сколько видеокарт использовать или какие именно, нужно вместо передать (использовать первые 2 видеокарты) или (использовать 2 и 3 видеокарту), например:

sudo docker run --gpus '"device=1,2"' -ti --rm my_image:version

Более подробно про параметр можно посмотреть в репозитории .

Miscellaneous

CUDA Samples
This document contains a complete listing of the code samples that are
included with the NVIDIA CUDA Toolkit. It describes each code sample,
lists the minimum GPU specification, and provides links to the source
code and white papers if available.
CUDA Demo Suite
This document describes the demo applications shipped with the CUDA Demo Suite.
CUDA on WSL
This guide is intended to help users
get started with using NVIDIA CUDA on Windows Subsystem for Linux (WSL 2).
The guide covers installation and running CUDA applications and containers
in this environment.
Multi-Instance GPU (MIG)
This edition of the user guide describes the Multi-Instance GPU feature of the NVIDIA A100 GPU.
CUDA Compatibility
This document describes CUDA Compatibility, including CUDA Enhanced Compatibility and CUDA Forward Compatible Upgrade.
CUPTI
The CUPTI-API. The CUDA Profiling Tools Interface (CUPTI)
enables the creation of profiling and tracing tools that target CUDA applications.
Debugger API
The CUDA debugger API.
GPUDirect RDMA
A technology introduced in Kepler-class GPUs and CUDA 5.0,
enabling a direct path for communication between the GPU and a third-party peer
device on the PCI Express bus when the devices share the same upstream
root complex using standard features of PCI Express. This document
introduces the technology and describes the steps necessary to enable a
GPUDirect RDMA connection to NVIDIA GPUs within the Linux device
driver model.
vGPU
vGPUs that support CUDA.

Software Requirements

The following tables highlight the compatibility of cuDNN versions with
the various supported OS versions.

Refer to the following table to view the list of supported Linux versions for cuDNN.

Linux versions for cuDNN 8.0.5 release

Architecture OS Name OS Version Distro Information
Kernel GCC Glibc
x86_64 RHEL 7.8 3.10.0 4.8.5 2.17
8.2 4.18 8.3.1 2.28
Ubuntu 20.04 5.4.0 9.3.0 2.32
18.04.5 LTS 4.15.0 8.2.0 2.27
16.04.6 LTS 4.5.0 5.4.0 2.23
ppc64le RHEL 8.2 4.18 8.3.1 2.28
AArch64 SBSA RHEL 8 4.18 8.3.0 2.28
Ubuntu 18.04 4.15 8.3.0 2.27
AArch64 Ubuntu 18.04 4.15 7.3.1 2.27

Linux versions for cuDNN 8.0.4 release

Architecture OS Name OS Version Distro Information
Kernel GCC Glibc
x86_64 RHEL 7.8 3.10.0 4.8.5 2.17
8.2 4.18 8.3.1 2.28
Ubuntu 18.04.4 LTS 4.15.0 8.2.0 2.27
16.04.6 LTS 4.5.0 5.4.0 2.23
ppc64le RHEL 8.2 4.18 8.3.1 2.28
Ubuntu 18.04.4 LTS 4.4.0 5.4.0 2.27
AArch64 SBSA RHEL 8 4.18 8.3.0 2.28
Ubuntu 18.04 4.15 8.3.0 2.27
AArch64 Ubuntu 18.04      

Linux versions for cuDNN 8.0.2 — 8.0.3 releases

Architecture OS Name OS Version Distro Information
Kernel GCC Glibc
x86_64 RHEL 7.6 3.10.0 4.8.5 2.17
8.1 4.18 8.3.1 2.28
Ubuntu 18.04.4 LTS 4.15.0 8.2.0 2.27
16.04.6 LTS 4.5.0 5.4.0 2.23
ppc64le Ubuntu 18.04.4 LTS 4.4.0 5.4.0 2.27
RHEL 7.6      
8.1      
AArch64 Ubuntu18_04        

Linux versions for cuDNN 8.0.0 — 8.0.1 Preview releases

Architecture OS Name OS Version Distro Information
Kernel GCC Glibc
x86_64 RHEL 7.6 3.10.0 4.8.5 2.17
8.1 4.18 8.3.1 2.28
Ubuntu 18.04.3 LTS 4.15.0 8.2.0 2.27
16.04.6 LTS 4.5.0 5.4.0 2.23
ppc64le Ubuntu 18.04.3 LTS 4.4.0 5.4.0 2.27
RHEL7 7.6      
8.1      
AArch64 Ubuntu18_04        

Linux versions for cuDNN 7.6.4 — 7.6.5 releases

Architecture OS Name OS Version Distro Information
Kernel GCC Glibc
x86_64 RHEL 7.6 3.10.0 4.8.5 2.17
Ubuntu 18.04.3 LTS 4.15.0 8.2.0 2.27
16.04.6 LTS 4.5.0 5.4.0 2.23
ppc64le Ubuntu 18.04.3 LTS 4.4.0 5.4.0 2.27
RHEL 7.6      
AArch64 Ubuntu18_04        

Dependencies

Some CUDA Samples rely on third-party applications and/or libraries, or features provided by the CUDA Toolkit and Driver, to either build or execute. These dependencies are listed below.

If a sample has a third-party dependency that is available on the system, but is not installed, the sample will waive itself at build time.

Each sample’s dependencies are listed in its README’s Dependencies section.

Third-Party Dependencies

These third-party dependencies are required by some CUDA samples. If available, these dependencies are either installed on your system automatically, or are installable via your system’s package manager (Linux) or a third-party website.

DirectX

DirectX is a collection of APIs designed to allow development of multimedia applications on Microsoft platforms. For Microsoft platforms, NVIDIA’s CUDA Driver supports DirectX. Several CUDA Samples for Windows demonstrates CUDA-DirectX Interoperability, for building such samples one needs to install Microsoft Visual Studio 2012 or higher which provides Microsoft Windows SDK for Windows 8.

OpenGL

OpenGL is a graphics library used for 2D and 3D rendering. On systems which support OpenGL, NVIDIA’s OpenGL implementation is provided with the CUDA Driver.

OpenGL ES

OpenGL ES is an embedded systems graphics library used for 2D and 3D rendering. On systems which support OpenGL ES, NVIDIA’s OpenGL ES implementation is provided with the CUDA Driver.

X11

X11 is a windowing system commonly found on *-nix style operating systems. X11 can be installed using your Linux distribution’s package manager, and comes preinstalled on Mac OS X systems.

EGL

EGL is an interface between Khronos rendering APIs (such as OpenGL, OpenGL ES or OpenVG) and the underlying native platform windowing system.

EGLSync

EGLSync is a set of EGL extensions which provides sync objects that are synchronization primitive, representing events whose completion can be tested or waited upon.

NVSCI

NvSci is a set of communication interface libraries out of which CUDA interops with NvSciBuf and NvSciSync. NvSciBuf allows applications to allocate and exchange buffers in memory. NvSciSync allows applications to manage synchronization objects which coordinate when sequences of operations begin and end.

CUDA Features

These CUDA features are needed by some CUDA samples. They are provided by either the CUDA Toolkit or CUDA Driver. Some features may not be available on your system.

CUFFT Callback Routines are user-supplied kernel routines that CUFFT will call when loading or storing data. These callback routines are only available on Linux x86_64 and ppc64le systems.

CUDA Dynamic Parallellism

CDP (CUDA Dynamic Parallellism) allows kernels to be launched from threads running on the GPU. CDP is only available on GPUs with SM architecture of 3.5 or above.

Multi-block Cooperative Groups

Multi Block Cooperative Groups(MBCG) extends Cooperative Groups and the CUDA programming model to express inter-thread-block synchronization. MBCG is available on GPUs with Pascal and higher architecture.

Multi-Device Cooperative Groups

Multi Device Cooperative Groups extends Cooperative Groups and the CUDA programming model enabling thread blocks executing on multiple GPUs to cooperate and synchronize as they execute. This feature is available on GPUs with Pascal and higher architecture.

CUSOLVER

CUSOLVER library is a high-level package based on the CUBLAS and CUSPARSE libraries. It combines three separate libraries under a single umbrella, each of which can be used independently or in concert with other toolkit libraries. The intent ofCUSOLVER is to provide useful LAPACK-like features, such as common matrix factorization and triangular solve routines for dense matrices, a sparse least-squares solver and an eigenvalue solver. In addition cuSolver provides a new refactorization library useful for solving sequences of matrices with a shared sparsity pattern.

NVJPEG

NVJPEG library provides high-performance, GPU accelerated JPEG decoding functionality for image formats commonly used in deep learning and hyperscale multimedia applications.

Stream Priorities

Stream Priorities allows the creation of streams with specified priorities. Stream Priorities is only available on GPUs with SM architecture of 3.5 or above.

Unified Virtual Memory

UVM (Unified Virtual Memory) enables memory that can be accessed by both the CPU and GPU without explicit copying between the two. UVM is only available on Linux and Windows systems.

16-bit Floating Point

FP16 is a 16-bit floating-point format. One bit is used for the sign, five bits for the exponent, and ten bits for the mantissa.

Turing Compatibility

This application note, Turing Compatibility Guide for CUDA
Applications
, is intended to help developers ensure that their
NVIDIA CUDA applications
will run on GPUs based on the NVIDIA Turing
Architecture. This document provides guidance to developers who are
already familiar with programming in CUDA C++ and want to make sure
that their software applications are compatible with Turing.

The NVIDIA CUDA C++ compiler, nvcc, can be used to
generate both architecture-specific cubin files and
forward-compatible PTX versions of each kernel. Each cubin
file targets a specific compute-capability version and is
forward-compatible only with GPU architectures of the same major
version number. For example, cubin files that target compute
capability 3.0 are supported on all compute-capability 3.x (Kepler)
devices but are not supported on compute-capability 5.x (Maxwell) or 6.x
(Pascal) devices. For this reason, to ensure forward compatibility
with GPU architectures introduced after the application has been
released, it is recommended that all applications include PTX versions
of their kernels.

Note: CUDA Runtime applications containing both cubin and PTX code for
a given architecture will automatically use the cubin by default,
keeping the PTX path strictly for forward-compatibility purposes.

Applications that already include PTX versions of their kernels
should work as-is on Turing-based GPUs. Applications that only support
specific GPU architectures via cubin files, however, will need to be
updated to provide Turing-compatible PTX or cubins.

The Turing architecture is based on Volta’s Instruction Set Architecture ISA 7.0,
extending it with new instructions. As a consequence, any binary that runs on Volta will be
able to run on Turing (forward compatibility), but a Turing binary will not be able to run on Volta.
Please note that Volta kernels using more than 64KB of shared memory (via the explicit opt-in,
see CUDA C++ Programming Guide) will not be able to launch on Turing, as they would exceed
Turing’s shared memory capacity.

Most applications compiled for Volta should run efficiently on Turing, except if the application
uses heavily the Tensor Cores, or if recompiling would allow use of new Turing-specific instructions.
Volta’s Tensor Core instructions can only reach half of the peak performance on Turing.
Recompiling explicitly for Turing is thus recommended.

The first step is to check that Turing-compatible device code (at
least PTX) is compiled into the application. The following sections
show how to accomplish this for applications built with different CUDA
Toolkit versions.

CUDA applications built using CUDA Toolkit versions 2.1 through
8.0 are compatible with Turing as long as they are built to include
PTX versions of their kernels. To test that PTX JIT is working for
your application, you can do the following:

  • Download and install the latest driver from http://www.nvidia.com/drivers.
  • Set the environment variable
    CUDA_FORCE_PTX_JIT=1.
  • Launch your application.

When starting a CUDA application for the first time with the above
environment flag, the CUDA driver will JIT-compile the PTX for each
CUDA kernel that is used into native cubin code.

If you set the environment variable above and then launch your
program and it works properly, then you have successfully verified
Turing compatibility.

Note: Be sure to unset the CUDA_FORCE_PTX_JIT environment variable
when you are done testing.

CUDA applications built using CUDA Toolkit 9.x are
compatible with Turing as long as they are built to include
kernels in either Volta-native cubin format (see ) or PTX format
(see )
or both.

CUDA applications built using CUDA Toolkit 10.0 are
compatible with Turing as long as they are built to include
kernels in Volta-native or Turing-native cubin format (see ), or PTX format
(see ),
or both.

Cross-compiling cuDNN Samples

This section describes how to cross-compile cuDNN
samples.

Follow the below steps to cross-compile cuDNN samples on NVIDIA DRIVE OS
Linux.

Procedure

  1. Download the Ubuntu package: cuda*ubuntu*_amd64.deb
  2. Download the cross compile package:
    cuda*-cross-aarch64*_all.deb
  3. Execute the following commands:
    sudo dpkg -i cuda*ubuntu*_amd64.deb
    sudo apt-get update
    sudo apt-get install cuda-toolkit-x-x -y
    sudo apt-get install cuda-cross-aarch64* -y

Procedure

  1. Download cuDNN Ubuntu package for your preferred CUDA Toolkit version:
    *libcudnn7-cross-aarch64_*.deb
  2. Download the cross compile package:
    libcudnn7-dev-cross-aarch64_*.deb
  3. Execute the following commands:
    sudo dpkg -i *libcudnn7-cross-aarch64_*.deb
    sudo dpkg -i libcudnn7-dev-cross-aarch64_*.deb

Procedure

  1. Copy the cudnn_samples_v7 directory to your home
    directory:
    $ cp -r /usr/src/cudnn_samples_v7 $HOME
  2. For each sample, execute the following commands:
    $ cd $HOME/cudnn_samples_v7/(each sample)
    $ make TARGET_ARCH=aarch64
    

CUDA API References

CUDA Runtime API
The CUDA runtime API.
CUDA Driver API
The CUDA driver API.
CUDA Math API
The CUDA math API.
cuBLAS
The cuBLAS library is an implementation of BLAS (Basic Linear Algebra Subprograms) on top of the NVIDIA CUDA runtime. It allows
the user to access the computational resources of NVIDIA Graphical Processing Unit (GPU), but does not auto-parallelize across
multiple GPUs.
NVBLAS
The NVBLAS library is a multi-GPUs accelerated drop-in BLAS (Basic Linear Algebra Subprograms) built on top of the NVIDIA
cuBLAS Library.
nvJPEG
The nvJPEG Library provides high-performance GPU accelerated JPEG
decoding functionality for image formats commonly used in deep learning and hyperscale
multimedia applications.
cuFFT
The cuFFT library user guide.
nvGRAPH
The nvGRAPH library user guide.
cuRAND
The cuRAND library user guide.
cuSPARSE
The cuSPARSE library user guide.
NPP
NVIDIA NPP is a library of functions for performing CUDA accelerated
processing. The initial set of functionality in the library focuses on
imaging and video processing and is widely applicable for developers in
these areas. NPP will evolve over time to encompass more of the compute
heavy tasks in a variety of problem domains. The NPP library is written
to maximize flexibility, while maintaining high performance.
NVRTC (Runtime Compilation)
NVRTC is a runtime compilation library for CUDA C++.
It accepts CUDA C++ source code in character string form and creates
handles that can be used to obtain the PTX.
The PTX string generated by NVRTC can be loaded by cuModuleLoadData and
cuModuleLoadDataEx, and linked with other modules by cuLinkAddData of
the CUDA Driver API.
This facility can often provide optimizations and performance not
possible in a purely offline static compilation.
Thrust
The Thrust getting started guide.
cuSOLVER
The cuSOLVER library user guide.

1. Установка драйвера для видеокарты NVIDIA

  • CUDA 10.0: драйвер версии 410.48 или выше
  • CUDA 10.1: драйвер версии 418.39 или выше
  • CUDA 10.2: драйвер версии 440.33 или выше

Установить драйвер нужной версии можно следующими способами:

  1. Воспользоваться скриптом , который в качестве аргумента принимает версию драйвера (если не передавать аргумент — установить версию ):
sudo ./install_nvidia-driver.sh 
  1. Вручную в терминале выполнить:
sudo add-apt-repository -y ppa:graphics-drivers/ppa
sudo apt-get -y update
sudo apt-get -y install nvidia-driver-440

После установки необходимо перезагрузить хост-машину. Что бы убедиться, что драйвер успешно установлен, можно вызвать в терминале . Результат должен быть примерно следующий:

+-----------------------------------------------------------------------------+
| NVIDIA-SMI 440.59       Driver Version: 440.59       CUDA Version: 10.2     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|===============================+======================+======================|
|   0  GeForce RTX 208...  Off  | 00000000:08:00.0 Off |                  N/A |
| 39%   42C    P0    18W / 250W |      0MiB / 11019MiB |      0%      Default |
+-------------------------------+----------------------+----------------------+

+-----------------------------------------------------------------------------+
| Processes:                                                       GPU Memory |
|  GPU       PID   Type   Process name                             Usage      |
|=============================================================================|
|  No running processes found                                                 |
+-----------------------------------------------------------------------------+

Getting Started

Getting started with running CUDA on WSL requires you to complete these steps in order:

Install the latest builds from the Microsoft Windows Insider Program

  • Register for the .

  • Install the latest build from the .

    Note:

    Ensure that you install Build version 20145 or higher.

    You can check your build version number by running winver via the Windows Run command.

  • Download the NVIDIA Driver from the download section on the

    CUDA on WSL page.
    Choose the appropriate driver depending on the type of NVIDIA GPU in your system — GeForce and Quadro.

  • Install the driver using the executable. This is the only driver you need to
    install.

  • The DirectX WSL driver is installed automatically along with other driver
    components so no additional action is needed for installation. This driver
    enables graphics on WSL2.0 by supporting DX12 APIs. TensorFlow with DirectML
    support on WSL will get NV GPU hardware acceleration for training and inference
    workloads. There are no present capabilities in WSL, hence the driver is
    oriented towards compute/machine learning tasks. For some helpful examples, see
    https://docs.microsoft.com/en-us/windows/win32/direct3d12/gpu-tensorflow-wsl.

Note:

Do not install any Linux display driver in WSL. The Windows Display Driver
will install both the regular driver components for native Windows and for
WSL support.

Demos

Below are the demos within the demo suite.

This application enumerates the properties of the CUDA devices present in the system and displays them in a human readable
format.

This application is a very basic demo that implements element by element vector addition.

This application provides the memcopy bandwidth of the GPU and memcpy bandwidth across PCI‑e. This application is capable
of measuring device to device copy bandwidth, host to device copy bandwidth for pageable and page-locked memory, and device
to host copy bandwidth for pageable and page-locked memory.

Arguments:

Usage:  bandwidthTest ...
Test the bandwidth for device to host, host to device, and device to device transfers

Example:  measure the bandwidth of device to host pinned memory copies in the range 1024 Bytes to 102400 Bytes in 1024 Byte increments
./bandwidthTest --memory=pinned --mode=range --start=1024 --end=102400 --increment=1024 --dtoh
Options Explanation

—help

Display this help menu

—csv

Print results as a CSV

—device=

all

0,1,2,…,n

Specify the device device to be used

compute cumulative bandwidth on all the devices

Specify any particular device to be used

—memory=

pageable

pinned

Specify which memory mode to use

pageable memory

non-pageable system memory

—mode=

quick

range

shmoo

Specify the mode to use

performs a quick measurement

measures a user-specified range of values

performs an intense shmoo of a large range of values

—htod

Measure host to device transfers

—dtoh

Measure device to host transfers

—dtod

Measure device to device transfers

—wc

Allocate pinned memory as write-combined

—cputiming

Force CPU-based timing always

Range Mode options

—start=

—end=

—increment=]

Starting transfer size in bytes

Ending transfer size in bytes

Increment size in bytes

Provides detailed statistics about peer-to-peer memory bandwidth amongst GPUs present in the system as well as pinned, unpinned
memory bandwidth.

Arguments:

Options Explanation

-h

print usage

-p

enable or disable pinned memory tests (default on)

-u

enable or disable unpinned memory tests (default off)

-e

enable or disable unpinned memory tests (default off)

-u

enable or disable p2p enabled memory tests (default on)

-d

enable or disable p2p disabled memory tests (default off)

-a

enable all tests

-n

disable all tests

Order of parameters maters.
Examples:
    ./BusGrind -n -p 1 -e 1   Run all pinned and P2P tests
    ./BusGrind -n -u 1        Runs only unpinned tests
    ./BusGrind -a             Runs all tests (pinned, unpinned, p2p enabled, p2p disabled)

This demo does an efficient all-pairs simulation of a gravitational n-body simulation in CUDA. It scales the n-body simulation
across multiple GPUs in a single PC if available. Adding «-numbodies=num_of_bodies» to the command line will allow users to
set # of bodies for simulation. Adding «-numdevices=N» to the command line option will cause the sample to use N devices
(if available) for simulation. In this mode, the position and velocity data for all bodies are read from system memory using
«zero copy» rather than from device memory. For a small number of devices (4 or fewer) and a large enough number of bodies,
bandwidth is not a bottleneck so we can achieve strong scaling across these devices.

Arguments:

Options Explanation

-fullscreen

run n-body simulation in fullscreen mode

-fp64

use double precision floating point values for simulation

-hostmem

stores simulation data in host memory

-benchmark

run benchmark to measure performance

-numbodies=N

number of bodies (>= 1) to run in simulation

-device=d

where d=0,1,2…. for the CUDA device to use

-numdevices=i

where i=(number of CUDA devices > 0) to use for simulation

-compare

compares simulation results running once on the default GPU and once on the CPU

-cpu

run n-body simulation on the CPU

-tipsy=file.bin

load a tipsy model file for simulation

This is a graphical demo which simulates an ocean height field using the CUFFT library, and renders the result using OpenGL.

The following keys can be used to control the output:

Keys Function

w

Toggle wireframe

Building CUDA Samples

Windows

The Windows samples are built using the Visual Studio IDE. Solution files (.sln) are provided for each supported version of Visual Studio, using the format:

Complete samples solution files exist at parent directory of the repo:

Each individual sample has its own set of solution files at:

To build/examine all the samples at once, the complete solution files should be used. To build/examine a single sample, the individual sample solution files should be used.

Linux

The Linux samples are built using makefiles. To use the makefiles, change the current directory to the sample directory you wish to build, and run make:

The samples makefiles can take advantage of certain options:

  • dbg=1 — build with debug symbols

  • SMS=»A B …» — override the SM architectures for which the sample will be built, where is a space-delimited list of SM architectures. For example, to generate SASS for SM 50 and SM 60, use .

CUDA API References

CUDA Runtime API
Fields in structures might appear in order that is different from the order of declaration.
CUDA Driver API
Fields in structures might appear in order that is different from the order of declaration.
CUDA Math API
The CUDA math API.
cuBLAS
The cuBLAS library is an implementation of BLAS (Basic Linear Algebra Subprograms) on top of the NVIDIA CUDA runtime. It allows
the user to access the computational resources of NVIDIA Graphical Processing Unit (GPU), but does not auto-parallelize across
multiple GPUs.
NVBLAS
The NVBLAS library is a multi-GPUs accelerated drop-in BLAS (Basic Linear Algebra Subprograms) built on top of the NVIDIA
cuBLAS Library.
nvJPEG
The nvJPEG Library provides high-performance GPU accelerated JPEG
decoding functionality for image formats commonly used in deep learning and hyperscale
multimedia applications.
cuFFT
The cuFFT library user guide.
CUB
The user guide for CUB.
CUDA C++ Standard
The API reference for libcu++, the CUDA C++ standard library.
cuRAND
The cuRAND library user guide.
cuSPARSE
The cuSPARSE library user guide.
NPP
NVIDIA NPP is a library of functions for performing CUDA accelerated
processing. The initial set of functionality in the library focuses on
imaging and video processing and is widely applicable for developers in
these areas. NPP will evolve over time to encompass more of the compute
heavy tasks in a variety of problem domains. The NPP library is written
to maximize flexibility, while maintaining high performance.
NVRTC (Runtime Compilation)
NVRTC is a runtime compilation library for CUDA C++.
It accepts CUDA C++ source code in character string form and creates
handles that can be used to obtain the PTX.
The PTX string generated by NVRTC can be loaded by cuModuleLoadData and
cuModuleLoadDataEx, and linked with other modules by cuLinkAddData of
the CUDA Driver API.
This facility can often provide optimizations and performance not
possible in a purely offline static compilation.
Thrust
The Thrust getting started guide.
cuSOLVER
The cuSOLVER library user guide.

Forward-Compatible Upgrade Path

To meet the minimum requirements mentioned in Section 1.3, the upgrade path for CUDA
usually involves upgrades to both the CUDA Toolkit and NVIDIA driver as shown in the
figure below.

Figure 3. CUDA Upgrade Path

Starting with CUDA 10.0, NVIDIA introduced a new forward-compatible
upgrade path that allows the kernel mode components on the system to remain
untouched, while the CUDA driver is upgraded. See Figure 3. This allows the use
of newer toolkits on existing system installations, providing improvements and
features of the latest CUDA while minimizing the risks associated with new
driver deployments. This upgrade path is achieved through new packages provided
by CUDA.

Figure 4. Forward Compatibility Upgrade Path

CUDA и язык C:

  1. Спецификаторы функций, которые показывают, как и откуда буду выполняться функции.
  2. Спецификаторы переменных, которые служат для указания типа используемой памяти GPU.
  3. Спецификаторы запуска ядра GPU.
  4. Встроенные переменные для идентификации нитей, блоков и др. параметров при исполнении кода в ядре GPU .
  5. Дополнительные типы переменных.
  • __host__ — выполнятся на CPU, вызывается с CPU (в принципе его можно и не указывать).
  • __global__ — выполняется на GPU, вызывается с CPU.
  • __device__ — выполняется на GPU, вызывается с GPU.
  • gridSize – размерность сетки блоков (dim3), выделенную для расчетов,
  • blockSize – размер блока (dim3), выделенного для расчетов,
  • sharedMemSize – размер дополнительной памяти, выделяемой при запуске ядра,
  • cudaStream – переменная cudaStream_t, задающая поток, в котором будет произведен вызов.
  • gridDim – размерность грида, имеет тип dim3. Позволяет узнать размер гридa, выделенного при текущем вызове ядра.
  • blockDim – размерность блока, так же имеет тип dim3. Позволяет узнать размер блока, выделенного при текущем вызове ядра.
  • blockIdx – индекс текущего блока в вычислении на GPU, имеет тип uint3.
  • threadIdx – индекс текущей нити в вычислении на GPU, имеет тип uint3.
  • warpSize – размер warp’а, имеет тип int (сам еще не пробовал использовать).

Краткая инструкция

Тестовая конфигурация

Тестирование проводилось на следующей конфигурации:

Сервер:
Ubuntu 16.04, GeForce GTX 660

Клиент:
Виртуальная машина с Ubuntu 16.04 на ноутбуке без дискретной видеокарты.

Получение rCUDA

Cамый сложный этап. К сожалению, на данный момент единственный способ получить свой экземпляр этого фреймворка — заполнить соответствующую форму запроса на официальном сайте. Впрочем, разработчики обещают отвечать в течение 1-2 дней. В моём случае мне прислали дистрибутив в тот же день.

Установка CUDA

Для начала необходимо установить CUDA Toolkit на сервере и клиенте (даже если на клиенте нет nvidia видеокарты). Для этого можно скачать его с официального сайта или использовать репозиторий. Главное, использовать версию не выше 8. В данном примере используется установщик .run с оффициального сайта.

Важно! На клиенте следует отказаться от установки nvidia драйвера. По умолчанию CUDA Toolkit будет доступен по адресу /usr/local/cuda/

Установите CUDA Samples, они понадобятся.

Установка rCUDA

Распакуем полученный от разработчиков архив в нашу домашнюю директорию на сервере и на клиенте.

Проделать эти действия нужно как на сервере, так и на клиенте.

Настройка клиента

Откроем на клиенте терминал, в котором в дальнейшем будем запускать CUDA код. На стороне клиента нам необходимо «подменить» стандартные библиотеки CUDA на библиотеки rCUDA, для чего добавим соответствующие пути в переменную среды LD_LIBRARY_PATH. Также нам необходимо указать количество серверов и их адреса (в моём примере он будет один).

Сборка и запуск

Попробуем собрать и запустить несколько примеров.

Пример 1

Начнём с простого, с deviceQuery — примера, который просто выведет нам параметры CUDA совместимого устройства, то есть в нашем случае удалённого GTX660.

Важно! Без EXTRA_NVCCFLAGS=—cudart=shared чуда не получится
Замените на путь, который вы указали для CUDA Samples при установке CUDA. Запустим собранный пример:

Запустим собранный пример:

Если вы всё сделали правильно, результат будет примерно таким:

Самое главное, что мы должны увидеть:

Отлично! Нам удалось собрать и запустить CUDA приложение на машине без дискретной видеокарты, использовав для этого видеокарту, установленную на удалённом сервере.

Важно! Если вывод приложения начинается со строк вида:

значит необходимо добавить на сервере и на клиенте в файл «/etc/security/limits.conf» следующие строки:

Таким образом, вы разрешите всем пользователям (*) неограниченное (unlimited) блокирование памяти (memlock). Еще лучше будет заменить * на нужного пользователя, а вместо unlimited подобрать менее жирные права.

Пример 2

Теперь попробуем что-то поинтереснее. Протестируем реализацию скалярного произведения векторов с использованием разделяемой памяти и синхронизации («Технология CUDA в примерах» Сандерс Дж. Кэндрот Э. 5.3.1).

В данном примере мы рассчитаем скалярное произведение двух векторов размерностью 33 * 1024, сравнивая ответ с результатом, полученным на CPU.

Сборка и запуск:

Такой результат говорит нам, что всё у нас хорошо:

Пример 3

Запустим еще один стандартный тест CUDA- matrixMulCUBLAS (перемножение матриц).

Интересное нам:

Безопасность

Я не нашёл в документации к rCUDA упоминания о каком-либо способе авторизации. Думаю, на данный момент самое простое, что можно сделать, это открыть доступ к нужному порту (8308) только с определённого адреса.

При помощи iptables это будет выглядеть так:

В остальном оставляю вопрос безопасности за рамками данного поста.

Оцените статью
Рейтинг автора
5
Материал подготовил
Андрей Измаилов
Наш эксперт
Написано статей
116
Добавить комментарий