NVIDIA-DLI (HPC-1) Fundmentals of Accelerated Computing C++

NVIDIA-DLI (HPC-1) Fundmentals of Accelerated Computing C++

Source: NVIDIA Deep Learning Institute

  • Part 1: Accelerating Applications with CUDA C/C++
  • Part 2: Managing Accelerated Application Memory with CUDA C/C++ Unified Memory and nsys
  • Part 3: Asynchronous Streaming, and Visual Profiling for Accelerated Applications with CUDA C/C++

Part 1: Accelerating Applications with CUDA C/C++

Feature detection and matching


Accelerated computing is replacing CPU-only computing as best practice. The litany of breakthroughs driven by accelerated computing, the ever increasing demand for accelerated applications, programming conventions that ease writing them, and constant improvements in the hardware that supports them, are driving this inevitible transition.

At the center of accelerated computing’s success, both in terms of its impressive performance, and its ease of use, is the CUDA compute platform. CUDA provides a coding paradigm that extends languages like C, C++, Python, and Fortran, to be capable of running accelerated, massively parallelized code on the world’s most performant parallel processors: NVIDIA GPUs. CUDA accelerates applications drastically with little effort, has an ecosystem of highly optimized libraries for DNN, BLAS, graph analytics, FFT, and more, and also ships with powerful command line and visual profilers.

CUDA supports many, if not most, of the world’s most performant applications in, Computational Fluid Dynamics, Molecular Dynamics, Quantum Chemistry, Physics and HPC.

Learning CUDA will enable you to accelerate your own applications. Accelerated applications perform much faster than their CPU-only couterparts, and make possible computations that would be otherwise prohibited given the limited performance of CPU-only applications. In this lab you will receive an introduction to programming accelerated applications with CUDA C/C++, enough to be able to begin work accelerating your own CPU-only applications for performance gains, and for moving into novel computational territory.


By the time you complete this lab, you will be able to:

  • Write, compile, and run C/C++ programs that both call CPU functions and launch GPU kernels.
  • Control parallel thread hierarchy using execution configuration.
  • Refactor serial loops to execute their iterations in parallel on a GPU.
  • Allocate and free memory available to both CPUs and GPUs.
  • Handle errors generated by CUDA code.
  • Accelerate CPU-only applications.

Part 1 Tasks

Colab Pro

Part 2: Managing Accelerated Application Memory with CUDA C/C++ Unified Memory and nsys


The CUDA Best Practices Guide, a highly recommended followup to this and other CUDA fundamentals labs, recommends a design cycle called APOD: Assess, Parallelize, Optimize, Deploy. In short, APOD prescribes an iterative design process, where developers can apply incremental improvements to their accelerated application’s performance, and ship their code. As developers become more competent CUDA programmers, more advanced optimization techniques can be applied to their accelerated codebases.

This lab will support such a style of iterative development. You will be using the Nsight Systems command line tool nsys to qualitatively measure your application’s performance, and to identify opportunities for optimization, after which you will apply incremental improvements before learning new techniques and repeating the cycle. As a point of focus, many of the techniques you will be learning and applying in this lab will deal with the specifics of how CUDA’s Unified Memory works. Understanding Unified Memory behavior is a fundamental skill for CUDA developers, and serves as a prerequisite to many more advanced memory management techniques.


By the time you complete this lab, you will be able to:

  • Use the Nsight Systems command line tool (nsys) to profile accelerated application performance.
  • Leverage an understanding of Streaming Multiprocessors to optimize execution configurations.
  • Understand the behavior of Unified Memory with regard to page faulting and data migrations.
  • Use asynchronous memory prefetching to reduce page faults and data migrations for increased performance.
  • Employ an iterative development cycle to rapidly accelerate and deploy applications.

Part 2 Tasks

Colab Pro

Part 3: Asynchronous Streaming, and Visual Profiling for Accelerated Applications with CUDA C/C++

Recommended Reading: NVIDIA Developer Blog


The CUDA tookit ships with the Nsight Systems, a powerful GUI application to support the development of accelerated CUDA applications. Nsight Systems generates a graphical timeline of an accelerated application, with detailed information about CUDA API calls, kernel execution, memory activity, and the use of CUDA streams.

In this lab, you will be using the Nsight Systems timeline to guide you in optimizing accelerated applications. Additionally, you will learn some intermediate CUDA programming techniques to support your work: unmanaged memory allocation and migration; pinning, or page-locking host memory; and non-default concurrent CUDA streams.

At the end of this lab, you will be presented with an assessment, to accelerate and optimize a simple n-body particle simulator, which will allow you to demonstrate the skills you have developed during this course. Those of you who are able to accelerate the simulator while maintaining its correctness, will be granted a certification as proof of your competency.


By the time you complete this lab you will be able to:

  • Use Nsight Systems to visually profile the timeline of GPU-accelerated CUDA applications.
  • Use Nsight Systems to identify, and exploit, optimization opportunities in GPU-accelerated CUDA applications.
  • Utilize CUDA streams for concurrent kernel execution in accelerated applications.
  • (Optional Advanced Content) Use manual device memory allocation, including allocating pinned memory, in order to asynchronously transfer data in concurrent CUDA streams.

Part 3 Tasks

Colab Pro


  1. Chapter 31. Fast N-Body Simulation with CUDA

  2. Mini-N-Body

    • nbody-orig : the original, unoptimized simulation (also for CPU)
    • nbody-soa : Conversion from array of structures (AOS) data layout to structure of arrays (SOA) data layout
    • nbody-flush : Flush denormals to zero (no code changes, just a command line option)
    • nbody-block : Cache blocking
    • nbody-unroll / nbody-align : platform specific final optimizations (loop unrolling in CUDA, and data alignment on MIC)
  3. CUDA examples:

    • Simple
    • Utilities
    • Graphics
    • Imaging
    • Finance
    • Simulations
    • Advanced
    • CUDA Libraries

NVIDIA-DLI (HPC-1) Fundmentals of Accelerated Computing C++



Eva W.

Posted on


Updated on


Licensed under