NVIDIA-DLI (HPC-2) Accelerating CUDA C++ Applications with Concurrent Streams

NVIDIA-DLI (HPC-2) Accelerating CUDA C++ Applications with Concurrent Streams

NVIDIA DLI course link:

Course Abstract

The concurrent overlap of GPU computation and the transfer of memory to and from the GPU can drastically improve the performance of CUDA applications. In this workshop you will learn to utilize CUDA Streams to perform copy/compute overlap in CUDA C++ applications by:

  • Learning the rules and syntax governing the use of concurrent CUDA Streams
  • Refactoring and optimizing an existing CUDA C++ application to use CUDA Streams and perform copy/compute overlap
  • Rely on the NVIDIA® Nsight™ Systems Visual Profiler timeline to observe improvement opportunities and the impact of the techniques covered in the workshop.

Upon completion, you will be able to build robust and efficient CUDA C++ applications that can leverage copy/compute overlap for significant performance gains.

Prerequisities

  • Professional experience programming CUDA C/C++ applications, including the use of the nvcc compiler, kernel launches, grid-stride loops, host-to-device and device-to-host memory transfers, and CUDA error handling.
  • Familiarity with the Linux command line.
  • Experience using Makefiles to compile C/C++ code

Suggested Resources to Satisfy Prerequisites

Tools, Libraries, and Frameworks Used

  • CUDA C++
  • nvcc
  • Nsight Systems

Workshop Structure

  • Introduction
  • Using JupyterLab
  • Cipher Application Overview
  • Nsight Systems Setup
  • CUDA Streams
  • Kernel Launches in Non-Default Streams
  • Memory Copies in Non-Default Streams
  • Considerations for Copy/Compute Overlap

Introduction

Main Objective of this course:

  • Increase performance for single-node CUDA C/C++ applications using copy/compute overlap
  • Increase performance for single-node CUDA C/C++ applications using multiple GPUs
  • Increase performance for multi-node CUDA C/C++ applications using copy/compute overlap
  • Increase performance for multi-node CUDA C/C++ applications using multiple GPUs

Non-default stream destruction:

1
2
3
4
cudaStream_t stream;
cudaStreamCreate &stream;
kernel<<<grid, block, 0, stream>>>;
cudaStreamDestory stream;

NVIDIA-DLI (HPC-2) Accelerating CUDA C++ Applications with Concurrent Streams

https://evawyf.com/2021/08/03/nvidia-dli-concurrent-streams/

Author

Eva W.

Posted on

2021-08-03

Updated on

2021-08-13

Licensed under

Comments