CUDA Memory Access: Global, Zero-Copy, Unified

In this post we will look at different models of sharing memory between CPU and GPU devices with CUDA. It’s not directly related to C++, but as C++ developers we love to have a full control over memory usage, and for good reason – as we will soon see. Efficient data transfer between the host and the device can boost performance of your CUDA algorithms, and it’s important to understand various memory exchange patterns and their tradeoffs.

There’s no better place to start from than the classical vector addition task:

cudaMalloc(&a, …);
cudaMalloc(&b, …);
cudaMalloc(&c, …);

cudaMemcpy(a, …, cudaMemcpyHostToDevice);
cudaMemcpy(b, …, cudaMemcpyHostToDevice);

kernel<<<...>>>(a, b, c);

cudaMemcpy(…, c, cudaMemcpyDeviceToHost);

The cudaMalloc allocates a chunk of Device memory, and cudaMemcpy copies the memory from/to device. Let’s see the Timeline trace in NVIDIA Visual Profiler:


It nicely shows all the interesting activities: Runtime API is a “CPU view” of the process, Memory and Compute show the time it takes to copy the data and execute the kernel respectively. This view also shows very clearly that (at least for a simple kernel) the memory operations take significant amount of time in the overall process, much larger than the computation itself.

Back to the memory analysis, first thing to notice is that this call is synchronous. The execution of the kernel code waits until the copying is complete, and only after the completion of the kernel we start copying the result back. We can obviously do better than that!

More Parallelism!

Let’s see what level of parallelism is supported by our device. First, we need to query device properties:

cudaDeviceProp props;
cudaGetDeviceProperties(&props, deviceid);

The deviceOverlap field indicates whether the GPU supports copying data in parallel with kernel execution. But it’s declared as deprecated, and we prefer checking the asyncEngineCount value instead. 0 would mean no overlap, value of 1 – we can copy memory in parallel with execution, and 2 – the device has two DMA engines, and can copy memory to and from device in parallel, while executing kernels at the same time.

The GPU that I’m currently using (Quadro M2000M) has a single “async” DMA engine, my ultimate performance limit would be having the whole process take the time of 3 buffer copies, with no delays and no processing in between. In this simple example it does not look like a huge improvement, but, of cause, the benefits of parallelism would be different for different kernels.

Our strategy would be breaking the buffers to smaller chunks, such that while executing the kernel on chunk n we will be copying chunk n+1 in.

We will be replacing cudaMemcpy calls with cudaMemcpyAsync, so that we don’t block the host code until copy operation is done. This function introduces additional parameter: stream, we will have to cudaStreamCreate different streams so they could run in parallel.

But in order for this operation to become truly asynchronous, the host memory must be pinned. Pinned memory is not pageable and is promised to reside in the memory as it can’t be swapped to the disk. There are two ways to get hold of a pinned memory: allocate it with cudaHostAlloc (instead of the standard operators new / malloc), or pin existing memory with cudaHostRegister. We will do the latter (with cudaHostRegisterPortable flag).

// Initialize the streams
static const int NUM_STREAMS = 2;
cudaStream_t s[NUM_STREAMS];
for (int i = 0; i < NUM_STREAMS; i++) {
   cudaStreamCreate(s + i);

// Schedule copy/kernel operations
auto chunks = 8;
auto chunkSize = size / chunks;
auto chunkLen = chunkSize * sizeof(int);
for (decltype(chunks) i = 0; i < chunks; i++) {
   auto offset = chunkSize * i;
   auto stream = s[i % NUM_STREAMS];
   cudaMemcpyAsync(dev_a + offset, a + offset, chunkLen, cudaMemcpyHostToDevice, stream);
   cudaMemcpyAsync(dev_b + offset, b + offset, chunkLen, cudaMemcpyHostToDevice, stream);
   addKernel<<<...>>>(dev_c + offset, dev_a + offset, dev_b + offset);

// Copy the results – in one go
cudaMemcpyAsync(c, dev_c, size * sizeof(int), cudaMemcpyDeviceToHost, s[0]);


The profiler shows nicely the interlacing of the copying + kernels on the two streams. In some applications, where the kernel processing time is longer than the copying, we may need to use more streams to achieve similar optimization. The number of chunks was selected to be 8 for illustration, in this case we could get the same performance with only two chunks (as the kernel processing is much faster than the memory operations), but in other scenarios having more chunks may be an advantage. We can still squeeze a bit of performance improvement from the last copy operation:

cudaMemcpyAsync(c, dev_c, size * sizeof(int) - chunkLen, cudaMemcpyDeviceToHost, s[0]);
auto lastChunkOffset = chunkSize * (chunks - 1);
cudaMemcpyAsync(c + lastChunkOffset, dev_c + lastChunkOffset, chunkLen, cudaMemcpyDeviceToHost, s[1]);


The total time for the whole process is 2.15 msec. On systems supporting simultaneous copying to/from device memory, we could save the last copy operation and copy the ready chunks in parallel, in that example we would also benefit from a larger number of chunks.

Last observation to make on this version of the code: the cudaHostRegister operation itself takes a very long time, almost 50 times the whole copy-process part, but we would assume that real applications would use more efficient memory management, like caching and reusing pinned memory pages.

Zero Copy

Next, we will attempt to use Zero-Copy approach, which uses mapped memory. The memory needs, again, to be pinned on the host, either with cudaHostAlloc or using cudaHostRegister (this time with cudaHostRegisterMapped flag). Next we will need to acquire device pointer which would refer the same memory with cudaHostGetDevicePointer and no explicit copying will be needed. The resulting code definitely looks simpler:

// First, pin the memory (or cudaHostAlloc instead)
cudaHostRegister(h_a, …, cudaHostRegisterMapped);
cudaHostRegister(h_b, …, cudaHostRegisterMapped);
cudaHostRegister(h_c, …, cudaHostRegisterMapped);

cudaHostGetDevicePointer(&a, h_a, 0);
cudaHostGetDevicePointer(&b, h_b, 0);
cudaHostGetDevicePointer(&c, h_c, 0);

kernel<<<...>>>(a, b, c);

// unpin/release host memory

The NVIDIA Visual Profiler does not give us meaningful insides to the kernel’s processing time breakdown, but it takes much longer – 70.1 msec (compared to 2.15 in previous version)


Unified Memory

Unified Memory not to be confused with Unified Virtual Addressing (UVA) feature – the latter means that devices with compute capability 2.0 and later do not need to call cudaHostGetDevicePointer after cudaHostAlloc, as the same address can be reused in the device memory space (you can even invoke cudaHostGetDevicePointer to validate that you are getting the same pointer back). Note that this is not the case for cudaHostRegister pinned memory. Otherwise UVA is an extension to a Zero-copy memory access.

Unified Memory is a feature that was introduced in CUDA 6, and at the first glimpse may look very similar to UVA – both the host and the device can use the same memory pointers. First difference is that Unified Memory does not require a non-pageable memory, and  works with “regular” paged memory. Unified memory is allocated by cudaMallocManaged API. The memory is mapped into same virtual address both on the CPU and the GPU but the page appears as available in the memory only on one of the devices. Let’s assume the memory page is currently mapped on the CPU. When a device code will try to read the memory, a page fault will occur, and the Unified Memory driver will then unmap the memory from the host, map it to the GPU and will do the actual copying of the data. You will see this process in the Visual Profiler as Data Migration. We start from straight-forward use of the unified memory:

cudaMallocManaged(&a, size);
cudaMallocManaged(&b, size);
cudaMallocManaged(&c, size);

// fill a and b with input data - memcpy

kernel<<<...>>>(a, b, c);

// access the memory on CPU – memcpy c to CPU memory

The result leaves space for improvement: total duration is 85 msec.


By zooming into the timeline, you can see a lot of individual data “migrations”:


Note that the Visual Profiles can show the migrations in 2 forms – show each instance separately, as in the image above, or show fixed-size segments in the timeline (the default mode), where the color of the segment would indicate how much time during this segment the data migration was happening.


According to “Unified Memory for CUDA Beginners” blog by Mark Harris (, highly recommended to read) we can cudaMemPrefetchAsync to load the memory to the GPU before invoking the kernel. I feel that it’s a bit of cheating – I don’t need Unified Memory to make explicit memory copies. And eventually I could not apply the trick – it’s not supported by my GPU (the cudaDevAttrConcurrentManagedAccess attribute is 0).

I also tried to modify the code to have less CUDA “threads” and have each thread handle multiple vector entries, hoping that this will play better with the memory migration mechanism, but it gave me very similar results.


Test Setup

I ran the tests on a Quadro M2000M GPU (for laptops), which is based on Maxwell architecture. I also tried them on GeForce GTX 1080i (newer Pascal architecture) and got very similar results, with all of the durations reduced by certain factor.

The kernel was extremely simple, and as the computational part gets more complicated, the relative performance loss of inefficient memory copies reduces, and convenience becomes significant factor in choosing the right approach.

Same goes for system complexity – as we start splitting the work wo multiple GPU devices, we will benefit more from not taking care for memory transfers by ourselves. I’ve seen many benchmarks that show very similar performance of Unified Memory and hand-crafted memory copies, so my tests may not be representative for many use cases.

More on Unified Memory

In my little test the GPU was accessing every cell in the input/output vectors. But consider an algorithm that only needs a sparse/selective data access, but we don’t know in advance which parts of the vectors will be requested – in the “classical” approach we would still need to copy all the data to the device’s Global memory, while the Unified memory mechanism would manage to only copy the relevant bits.

Another useful implication of Unified memory that kernels can operate on a data which is larger than the size of the device memory, you can imagine the Global device memory being a “cache” to the whole CPU memory. It’s true that a “cache fault” in this setup is costly, but there may be no better alternatives.

Unified Memory is expected to perform very badly when multiple devices access same memory page in parallel. The worst case I can imagine is having two integers reside in the same memory page, have CPU read one of them in a loop, while the GPU would access the second. It may look like each uses a separate variable, but in fact it will cause endless page faults on both sides and the page will be migrated back and forth.

There are more aspects to Unified Memory not covered here at all: functions like cudaMemAdvise, which come to help the driver select the best strategy, algorithms for automatic prediction of which pages are going to be requested next.


Manual memory copying: gives the best performance, requires extra work to make things run in parallel, different techniques may be needed depending on the features of your device, e.g. whether it supports concurrent copying to and from GPU memory (2 DMA engines). Asynchronous copy operations require pinned memory.

Zero copy: convenience comes with a price, both in performance and in the need to grow the non-pageable memory size. Advantage: existing memory, that may even come from an external code, can be pinned and used as zero-copy.

Unified memory: even more convenient than Zero-copy feature, especially on complex multi-GPU systems. But requires a special memory allocation, can’t “upgrade” pre-allocated memory to become Unified.

Considering all the test limitations discussed, I believe that certain conclusions can still be made – in many cases the optimal solution requires explicit memory management, and it’s an integral part of the overall algorithm design. As C++ developers, we are not surprised!


CUDA Runtime Templates

The CUDA C/C++ platform allows different programming modes for invoking code on a GPU device. Probably the more familiar and definitely simpler way is writing a single .cu file which contains both the kernel function and the host wrapper with “<<< >>>” invocation syntax. The NVCC compiler would transparently embed compiled device code and all the management routines in the C++ object file. In this scenario the host part of the code will be using the “CUDA Runtime API”.

Alternative approach would be using “CUDA Driver API”, which resembles in some way the OpenCL standard. With the Driver API the programmer needs to explicitly initialize CUDA context and load the compiled device code (PTX). All the “Runtime API” functionality is also available as “Driver API”, often providing more flexibility to the user in the latter (Driver API function names usually start with “cu” prefix). And this post will be focused on the Runtime Compilation (NVRTC) part of the Driver API.

There may be various reasons to prefer the RTC approach, I would mention a couple of them:

  • No need for NVCC compiler – it’s all plain C++ code
  • Runtime tuning of compilation flags. The “architecture” is obvious – we know the exact device we run on, but others can be also relevant
  • Runtime Selection of Template Parameters

The last bullet is the one I’d like to explain with more details. The “kernel” functions may have template parameters, and as it’s always the case with templates, only certain specific instantiations of the function will be compiled (based on the usage on the calling side), and we can’t change this in runtime. But with NVRTC we can! In later posts I’m planning to show some examples on why we may actually need this (one spoiler: loop unrolling), but for now to explain the machinery.

Let’s start with following file:

template<typename type, int size>
__global__ void setKernel(type* c, type val) {
   auto idx = threadIdx.x * size;

   #pragma unroll(size)
   for (auto i = 0; i < size; i++) {
      c[idx] = val;

We could as well name it .txt, but it’s convenient to use the standard CUDA extension and even compile it in the “usual” way with NVCC, just to validate the syntax. In the code above, we initialize all c array cells with same value val, while each CUDA “thread” is responsible for a range of indices in the array. The question why we can’t just have each thread handle a single cell is out of scope right now, here it’s just for making the template parameters more interesting.

The rest of the post will show how can we load the above CUDA code in runtime, and invoke a specific setKernel<float, 10> instantiation of template kernel. 

Let’s see how we compile and run this kernel. The full code example can be found at, and it uses number of convenient wrappers on top of Device API which can be found in same repository (the wrappers are in no way complete and mainly serve the needs of this example).

At first, we select a CUDA device – by looking for a not integrated GPU (meaning – not integrated with CPU), set it as “current” and create the CUDA Context.

CudaDevice dev = CudaDevice::FindByProperties(CudaDeviceProperties::ByIntegratedType(false));
CudaContext ctx{ dev };

// see the reference for: cudaChooseDevice, cuInit, cuCtxCreate

Next, we load the source code of the CUDA program:

rtc::Program prog("myprog", rtc::Code::FromFile("..\\examples\\"));
// nvrtcCreateProgram

Now comes the first templates-related part: before we compile the program, we need to register the template instantiations of the kernel functions we are going to use. In our case we would like to use setKernel<float, 10>. The template string itself is built using the rtc::Kernel helper class:

auto kernel = rtc::Kernel("setKernel").instantiate<float, std::integral_constant<int, 10>>();
// uses nvrtcGetTypeName<T> to convert a type to a string (“float” in this example)

// nvrtcAddNameExpression

The Kernel::instantiate method takes a special care for std::integral_constant template parameter, and instead of converting it to a string, will use the integer value as is (10 in the above example).

It’s the time to compile the kernel:

// nvrtcCompileProgram

Note the use of device properties to set the highest supported architecture string. We can also control the C++ standard which is assumed by the compiler (C++14 in our case).

The compiled program (PTX) is loaded into a Module:

rtc::Module module{ ctx, prog };
// nvrtcGetPTX, cuModuleLoadDataEx

In order to find our setKernel<float, 10> function in the Module, we need first to get the decorated name, which is achieved using nvrtcGetLoweredName:

kernel.init(module, prog);
// nvrtcGetLoweredName, cuModuleGetFunction

Finally, we can launch the kernel on the GPU device:

kernel.launchAndWait(Stream::Default(), grid, 0, ptr, 23.1f);
// cuLaunchKernel, and cuStreamSynchronize to wait for completion

Note that the parameters to cuLaunchKernel are given as void* array, so we first get the pointers to all the function parameters (which start from the 4th parameter to Kernel::launch) :

template<typename... ARGS>
static inline std::vector<void*> BuildArgs(const ARGS& ...args) {
   return{ const_cast<void*>(reinterpret_cast<const void*>(&args))... };


Let’s recap on the main points of interest:

  • First, using the Runtime Compilation in CUDA we combined all the goodies of C++ templates with the ability to make the instantiation decisions in production! and this can be a very powerful tool for certain scenarios.
  • In order to use template functions with RTC, we need to be careful in invoking the APIs in the correct order:
    • Build the template instantiation string (nvrtcGetTypeName<T>) and register it in the program (nvrtcAddNameExpression) before the compilation.
    • Find the decorated name in the program (nvrtcGetLoweredName) after the compilation. The name provided to this function must exactly match the name first used with nvrtcAddNameExpression.
  • One important downside of RTC is our inability to validate the types of the parameters to the kernel function automatically by the compiler, so these must be carefully constructed.

Specific usage examples will be shown in next posts. All the code examples and helper classes can be found at


CUDA Lambdas

Lambda closures are an integral part of modern C++, in CUDA code they can be used in different levels. At the very basic, they can be used inside the device code:

__global__ void setValueInnerLambda(cudacpp::DeviceVector<int> c, int val) {
   auto idx = threadIdx.x;
   auto op = [=](int& i) { i = val; };

Starting with CUDA 7.5 we can create __device__ lambda and use it as a parameter to __global__ kernel (NOTE: this and later examples require to add a special command-line compilation flag: –expt-extended-lambda):

template<typename T>
__global__ void applyKernel(cudacpp::DeviceVector<int> c, T op) {
   auto idx = threadIdx.x;

void testLambda(int val) {
   // ...
   auto op = [=] __device__ (int& v) { v = val; };
   applyKernel<<<1, size>>>(dev_c, op);
   // ...

The lambda succeeded to capture the val argument by value. In this example we used int& as parameter type, starting with CUDA 9.0 auto parameters are supported in generic lambdas.

CUDA 8.0 adds support for __device__ __host__ lambdas, which can be used both in the device and the host code – this can be useful, for example, when the decision on whether we use GPU or CPU version of the code is done based on certain runtime condition.

Another interesting use case is capturing a copy of this object using [*this] specification (while [=] will only copy the this pointer, but not the members).

struct AddValue {
   int _val;
   // ... init, etc.

   void doApplyKernel(cudacpp::DeviceMemoryT<int>& dev_c) {
      auto f = [*this] __device__ (int& i) { i = _val; };
      applyKernel<<<1, dev_c.sizeElements()>>>(dev_c, f); 

The above code creates the lambda function and uses it in the same scope. Instead, we may want to return the lambda from the class and use it externally, like this:

struct AddValue {
   // ... continuing from previous example

   auto getFunc() { 
      return [*this] __device__ (int& i) { 
         i = _val; 

Unfortunately, this produces a compilation error:
error : The enclosing parent function (“getFunc”) for an extended __device__ lambda must not have deduced return type

The first attempt could be returning std::function<void(int&)> instead, but we can’t convert __device__ lambda to std::function.

As alternative the CUDA package provides nvstd::function (not surprisingly, via the <nvfunctional> header) that is capable of holding __device__ functions. But here we face another limitation: we can’t initialize nvstd::function in the host code, and pass it to device function.  That brings us to the following possible solution:

struct AddValue {
   // ... continuing from previous example

   // NOTE: 'getFunc' now becomes a __device__ function
   nvstd::function<void(int&)> __device__ getFunc() {
      return [*this] __device__ (int& i) {
         i = _val;

template<typename T>
__global__ void applyKernelGetFunc(cudacpp::DeviceVector<int> c, T op) {
   auto idx = threadIdx.x;

void testLambda(int val) {
   // ...
   AddValue addF{ val };
   applyKernelGetFunc<<<1, size>>>(dev_c, addF);
   // ...

This works as expected, but it’s worth noting that the kernel code produced is much longer (in number of binary instructions) compared to “pure” lambda versions [measured with cudobjdump, see Virtual CUDA post for usage details]. Hence use such constructs with caution, the overhead may be too costly.

All the code examples can be found at:
In same repository you can find Visual Studio project files (VS2015, CUDA 9.1).

Virtual CUDA

One of the most basic features of C++ is a runtime polymorphism, achieved with use of virtual functions. Interestingly enough, those are supported also by CUDA device code, but with certain restrictions – well, one may argue that these restrictions make the whole feature totally obsolete… but let’s first see a simple example of what it looks like.

class IntProvider {
   virtual __device__ int getNumber() const = 0;

class SimpleIntProvider
   : public IntProvider
   int _i;
   __device__ __inline__ SimpleIntProvider(int i) : _i(i) {}
   __device__ int getNumber() const override { return _i; }

class SumIntProvider
   : public IntProvider
   int _a;
   int _b;
   __device__ __inline__ SumIntProvider(int a, int b) : _a(a), _b(b) {}
   __device__ int getNumber() const override { return _a + _b; }

__device__ __inline__ void putValue(int& to, const IntProvider& ip) {
   to = ip.getNumber();

__global__ void addKernel(cudacpp::DeviceVector<int> c, int val)
   auto idx = threadIdx.x;
   SimpleIntProvider sip{ val };
   //SumIntProvider sip{ val, 2 };
   putValue(c[idx], sip);

Here the putValue device function calls a virtual method of IntProvider. The compiler places the virtual table for the classes in the global or constant device memory.

The main limitation of this solution is that “It is not allowed to pass as an argument to a __global__ function an object of a class derived from virtual base classes” (see docs). Which means that the runtime polymorphism can’t cross the host|device bridge (!) And that makes me wonder what would be a real case to use virtual functions in the CUDA code, compared to templated solution.

class SimpleIntProvider {
   int _i; 
   __device__ __inline__ SimpleIntProvider(int i) : _i(i) {}
   __device__ __inline__ int getNumber() const { return _i; } 

class SumIntProvider { 
   int _a;
   int _b;
   __device__ __inline__ SumIntProvider(int a, int b) : _a(a), _b(b) {}
   __device__ __inline__ int getNumber() const { return _a + _b; }

template<typename T>
__device__ __inline__ void putValue(int& to, const T& ip) {
   to = ip.getNumber();

We would not be surprised to discover that the template version is more efficient compared to the first one. One useful tool to compare kernels is looking at the binary instructions – using cuobjdump

cuobjdump -sass

You will see blocks like this in the output:

 Function : _ZN10nonvirtual15addKernelDirectEN7cudacpp12DeviceVectorIiEEi
 .headerflags @"EF_CUDA_SM20 EF_CUDA_PTX_SM(EF_CUDA_SM20)"
 /*0000*/ MOV R1, c[0x1][0x100]; /* 0x2800440400005de4 */
 /*0008*/ S2R R0, SR_TID.X; /* 0x2c00000084001c04 */
 /*0010*/ MOV32I R5, 0x4; /* 0x1800000010015de2 */
 /*0018*/ MOV R2, c[0x0][0x30]; /* 0x28004000c0009de4 */
 /*0020*/ IMAD.U32.U32 R4.CC, R0, R5, c[0x0][0x20]; /* 0x200b800080011c03 */
 /*0028*/ IADD R2, R2, 0x2; /* 0x4800c00008209c03 */
 /*0030*/ IMAD.U32.U32.HI.X R5, R0, R5, c[0x0][0x24]; /* 0x208a800090015c43 */
 /*0038*/ ST.E [R4], R2; /* 0x9400000000409c85 */
 /*0040*/ EXIT; /* 0x8000000000001de7 */

BTW, same can be done if you pass an EXE as a parameter, but it’s harder to understand which assembly refers to which kernel. Without the deep understanding of these instructions, we will use the golden rule that less lines in assembly is better. Moreover, we can see that the template version of the function looks exactly as the “most hardcoded version” of:

__global__ void addKernelDirect(cudacpp::DeviceVector<int> c, int val) {
   auto idx = threadIdx.x;
   c[idx] = val + 2;

which, again, is a great proof that in the kernel-side code we can benefit from the beloved C++ zero-cost abstractions. Needless to say that the virtual version produces much longer binary code, not talking about the possibly extra cost in fetching the vtable.

So if anybody comes with a good use case for virtual __device__ functions, let me know!

Full code examples can be found at:

CUDA++ Indexing

The very basic example of a toy kernel code starts with the line:

int idx = threadIdx.x;

This simple line contains in itself a decent amount of wisdom. First, it knows that the kernel was invoked with a single block, which is not a very practical case, more often the index calculation will look more like:

int idx = (blockIdx.x * blockDim.x) + threadIdx.x

Usually a validation that the index is in a valid range will follow, as the data size may not perfectly fit the total number of threads in the grid. But this line still makes important assumptions about the indexing of the blocks and the threads, in particular it knows that the indices are one-dimensional. This cheatsheet shows calculation of an index for all 9 permutations of block/thread dimensions. Still it does not cover all the real-life cases, as all the 9 functions return a single int index. Let’s make more order here.

We first start from the type of data we need to process (more often, of the output of the processing) – it can be one-dimensional array, a two-dimensional image, a three-dimensional matrix. Next, we will decide how to index the data in the kernel – it might be a “flat” index as in the above example, but, for example, for 2D image it makes sense to assign each kernel thread an (x, y) pixel, so the index becomes 2D, etc.

Following step would be deciding the total number of threads per block, taking in consideration shared memory requirements, warp size, hard limit on number of threads etc. In the same time, we need to define the indexing for threads in the block: (1024, 1, 1) or (32, 32, 1) or maybe (16, 8, 8). Having determined that, we can now reason about number of blocks needed in the grid, based on the total data size.

Of cause, the kernel code must be perfectly aware of all these decisions, and this coupling is exactly what we would like to fix with C++ power. Jumping to the solution, the kernel code could look like that:

template<int DIM_BLOCKS, int DIM_THREADS, typename T>
__global__ void addKernel(DeviceVector<T> c, const DeviceVector<T> a, const DeviceVector<T> b) {
   auto idx = Index<1>::create<DIM_BLOCKS, DIM_THREADS>();
   if (idx.inRange(c.size())) {
      c[idx] = a[idx] + b[idx];
  • It’s templated by the dimensions of the grid, and uses that information to build the correct index
  • It uses Index<1> as we know we are referencing an array, which is 1D
  • inRange helper function makes sure we are within the limits of the data
  • For convenience, DeviceVector defines operator[] based on Index<1>, in addition to std::size_t version

Accordingly, on the host side the kernel may be invoked this way:

auto grid = CreateGrid(Size<1>{4}, vec_c.size());  // automatically decide the number of blocks
addKernel<grid.DimBlocks, grid.DimThreads, int><<<grid.blocks, grid.threads>>>(vec_c, vec_a, vec_b);

A working implementation of this sample code can be found on One interesting aspect of it is that automatic CUDA variables, such as blockIdx/threadIdx can be used in external classes (Index in this case) and not directly in the kernel block.

We define the Grid as:

template<int DIM_BLOCKS, int DIM_THREADS>
struct Grid {
   Size<DIM_BLOCKS> blocks;
   Size<DIM_BLOCKS> threads;
   constexpr static int DimBlocks = DIM_BLOCKS;
   constexpr static int DimThreads = DIM_THREADS;

Size refers to:

template<int DIM> struct Size;  // DIM is the number of dimensions

The Size is undefined for general case and is specialized for possible dimensions: 1/2/3. In addition, there’s a specialization for Size<0>, which is useful for examples like the original code we started with: when the kernel was invoked with a single block, we could just use threadIdx.x as an index. In the proposed solution, that would manifest as:

Grid<0, 1> grid({}, {vecSize});

And the Index class can do same “optimization” of ignoring the block index, effectively assuming it’s always 0.

The Index is created with one of the specializations of:

template<int DIM_BLOCKS, int DIM_THREADS>
__host__ __device__ __inline__ static Index create();

For example:

__host__ __device__ __inline__ static Index create<1, 1>() {
   // Note the use of built-in variables
   return { (blockIdx.x * blockDim.x) + threadIdx.x };

Currently a partial set of most useful version is implemented, but it can be clearly extended to support any combination of grid dimensions, as long as it makes sense to the application.

Hello, CUDA++

CUDA for C/C++ is NVIDIA’s parallel programming platform for its GPU processors. Although ++ is in the name, many CUDA examples that you find are mainly C-style code, especially when talking about the kernels. Some libraries wrap the host APIs in nicer C++ classes, like by Eyal Rozenberg, but the kernel side is much overlooked and I believe it’s a missed opportunity to create a safer and cleaner code.

Let’s look at the most basic “Hello World” example, which also happens to be part of a CUDA project template in the Visual Studio. The kernel code looks like this:

__global__ void addKernel(int *c, const int *a, const int *b) {
   int idx = threadIdx.x;
   c[idx] = a[idx] + b[idx];

In this example we are adding two vectors a and b, and store the result in c. The kernel is invoked by the host as:

addKernel<<<1, size>>>(dev_c, dev_a, dev_b);

Note that dev_c (and other parameters) are all of type int*, but these must be pointers in the device (GPU) memory space. The problem is that nothing will warn you during compile time if you will try to send host-memory pointers to the kernel.

Instead let’s create a specialized class to hold device pointers, and only allow it to be passed to the kernel.

template<typename T>
class DeviceVector {
   T *_p;
   explicit DeviceVector (T *p) : _p(p) {}
   DeviceVector() : _p(nullptr) {}

   // we can add a destructor which will cudaFree the memory
   __device__ __inline__ T& operator[](size_t i) { return _p[i]; }
   __device__ __inline__ const T& operator[](size_t i) const { return _p[i]; }
   __device__ __host__ __inline__ operator T* () { return _p; }

   static DeviceVector allocate(size_t sz) {
      DeviceVector res;
      cudaMalloc((void**)&res._p, sizeof(T) * sz);
      return res;

Now the kernel may be written as:

__global__ void addKernel(DeviceVector<int> c, const DeviceVector<int> a, const DeviceVector<int> b) {
   int idx = threadIdx.x;
   c[idx] = a[idx] + b[idx];

While the host code becomes this:

auto dev_a = DeviceVector<int>::allocate(size);
auto dev_b = DeviceVector<int>::allocate(size);
auto dev_c = DeviceVector<int>::allocate(size);
// fill up the initial data
addKernel<<<1, size>>>(dev_c, dev_a, dev_b);

And now, as expected, the compilation will fail if we’ll try to send unknown int pointer to the kernel.

Moreover, we can further generalize the kernel to become:

template< typename T >
__global__ void addKernel(DeviceVector<T> c, const DeviceVector<T> a, const DeviceVector<T> b) {
   int idx = threadIdx.x;
   c[idx] = a[idx] + b[idx];

And the invoking code remains the same, as regular template type deduction rules are perfectly applicable here as well.

Next time I will address the other issue of the thread index calculation – here we simply used threadIdx.x and were not required to check for the boundaries, but assume the launching code is changed to:

addKernel<<<2, size/2>>>(dev_c, dev_a, dev_b);

When spanning more than one parallel block, the index calculation in the kernel should be enhanced. But here again we can use Zero-cost C++ abstractions to avoid such inconsistencies.

PS. The code is available at: