In this post we will look at different models of sharing memory between CPU and GPU devices with CUDA. It’s not directly related to C++, but as C++ developers we love to have a full control over memory usage, and for good reason – as we will soon see. Efficient data transfer between the host and the device can boost performance of your CUDA algorithms, and it’s important to understand various memory exchange patterns and their tradeoffs.
There’s no better place to start from than the classical vector addition task:
cudaMalloc(&a, …); cudaMalloc(&b, …); cudaMalloc(&c, …); cudaMemcpy(a, …, cudaMemcpyHostToDevice); cudaMemcpy(b, …, cudaMemcpyHostToDevice); kernel<<<...>>>(a, b, c); cudaDeviceSynchronize(); cudaMemcpy(…, c, cudaMemcpyDeviceToHost);
The cudaMalloc allocates a chunk of Device memory, and cudaMemcpy copies the memory from/to device. Let’s see the Timeline trace in NVIDIA Visual Profiler:
It nicely shows all the interesting activities: Runtime API is a “CPU view” of the process, Memory and Compute show the time it takes to copy the data and execute the kernel respectively. This view also shows very clearly that (at least for a simple kernel) the memory operations take significant amount of time in the overall process, much larger than the computation itself.
Back to the memory analysis, first thing to notice is that this call is synchronous. The execution of the kernel code waits until the copying is complete, and only after the completion of the kernel we start copying the result back. We can obviously do better than that!
More Parallelism!
Let’s see what level of parallelism is supported by our device. First, we need to query device properties:
cudaDeviceProp props; cudaGetDeviceProperties(&props, deviceid);
The deviceOverlap field indicates whether the GPU supports copying data in parallel with kernel execution. But it’s declared as deprecated, and we prefer checking the asyncEngineCount value instead. 0 would mean no overlap, value of 1 – we can copy memory in parallel with execution, and 2 – the device has two DMA engines, and can copy memory to and from device in parallel, while executing kernels at the same time.
The GPU that I’m currently using (Quadro M2000M) has a single “async” DMA engine, my ultimate performance limit would be having the whole process take the time of 3 buffer copies, with no delays and no processing in between. In this simple example it does not look like a huge improvement, but, of cause, the benefits of parallelism would be different for different kernels.
Our strategy would be breaking the buffers to smaller chunks, such that while executing the kernel on chunk n we will be copying chunk n+1 in.
We will be replacing cudaMemcpy calls with cudaMemcpyAsync, so that we don’t block the host code until copy operation is done. This function introduces additional parameter: stream, we will have to cudaStreamCreate different streams so they could run in parallel.
But in order for this operation to become truly asynchronous, the host memory must be pinned. Pinned memory is not pageable and is promised to reside in the memory as it can’t be swapped to the disk. There are two ways to get hold of a pinned memory: allocate it with cudaHostAlloc (instead of the standard operators new / malloc), or pin existing memory with cudaHostRegister. We will do the latter (with cudaHostRegisterPortable flag).
// Initialize the streams static const int NUM_STREAMS = 2; cudaStream_t s[NUM_STREAMS]; for (int i = 0; i < NUM_STREAMS; i++) { cudaStreamCreate(s + i); } // Schedule copy/kernel operations auto chunks = 8; auto chunkSize = size / chunks; auto chunkLen = chunkSize * sizeof(int); for (decltype(chunks) i = 0; i < chunks; i++) { auto offset = chunkSize * i; auto stream = s[i % NUM_STREAMS]; cudaMemcpyAsync(dev_a + offset, a + offset, chunkLen, cudaMemcpyHostToDevice, stream); cudaMemcpyAsync(dev_b + offset, b + offset, chunkLen, cudaMemcpyHostToDevice, stream); addKernel<<<...>>>(dev_c + offset, dev_a + offset, dev_b + offset); } // Copy the results – in one go cudaDeviceSynchronize(); cudaMemcpyAsync(c, dev_c, size * sizeof(int), cudaMemcpyDeviceToHost, s[0]);
The profiler shows nicely the interlacing of the copying + kernels on the two streams. In some applications, where the kernel processing time is longer than the copying, we may need to use more streams to achieve similar optimization. The number of chunks was selected to be 8 for illustration, in this case we could get the same performance with only two chunks (as the kernel processing is much faster than the memory operations), but in other scenarios having more chunks may be an advantage. We can still squeeze a bit of performance improvement from the last copy operation:
cudaMemcpyAsync(c, dev_c, size * sizeof(int) - chunkLen, cudaMemcpyDeviceToHost, s[0]); auto lastChunkOffset = chunkSize * (chunks - 1); cudaMemcpyAsync(c + lastChunkOffset, dev_c + lastChunkOffset, chunkLen, cudaMemcpyDeviceToHost, s[1]);
The total time for the whole process is 2.15 msec. On systems supporting simultaneous copying to/from device memory, we could save the last copy operation and copy the ready chunks in parallel, in that example we would also benefit from a larger number of chunks.
Last observation to make on this version of the code: the cudaHostRegister operation itself takes a very long time, almost 50 times the whole copy-process part, but we would assume that real applications would use more efficient memory management, like caching and reusing pinned memory pages.
Zero Copy
Next, we will attempt to use Zero-Copy approach, which uses mapped memory. The memory needs, again, to be pinned on the host, either with cudaHostAlloc or using cudaHostRegister (this time with cudaHostRegisterMapped flag). Next we will need to acquire device pointer which would refer the same memory with cudaHostGetDevicePointer and no explicit copying will be needed. The resulting code definitely looks simpler:
// First, pin the memory (or cudaHostAlloc instead) cudaHostRegister(h_a, …, cudaHostRegisterMapped); cudaHostRegister(h_b, …, cudaHostRegisterMapped); cudaHostRegister(h_c, …, cudaHostRegisterMapped); cudaHostGetDevicePointer(&a, h_a, 0); cudaHostGetDevicePointer(&b, h_b, 0); cudaHostGetDevicePointer(&c, h_c, 0); kernel<<<...>>>(a, b, c); cudaDeviceSynchronize(); // unpin/release host memory cudaHostUnregister(h_a); cudaHostUnregister(h_b); cudaHostUnregister(h_c);
The NVIDIA Visual Profiler does not give us meaningful insides to the kernel’s processing time breakdown, but it takes much longer – 70.1 msec (compared to 2.15 in previous version)
Unified Memory
Unified Memory not to be confused with Unified Virtual Addressing (UVA) feature – the latter means that devices with compute capability 2.0 and later do not need to call cudaHostGetDevicePointer after cudaHostAlloc, as the same address can be reused in the device memory space (you can even invoke cudaHostGetDevicePointer to validate that you are getting the same pointer back). Note that this is not the case for cudaHostRegister pinned memory. Otherwise UVA is an extension to a Zero-copy memory access.
Unified Memory is a feature that was introduced in CUDA 6, and at the first glimpse may look very similar to UVA – both the host and the device can use the same memory pointers. First difference is that Unified Memory does not require a non-pageable memory, and works with “regular” paged memory. Unified memory is allocated by cudaMallocManaged API. The memory is mapped into same virtual address both on the CPU and the GPU but the page appears as available in the memory only on one of the devices. Let’s assume the memory page is currently mapped on the CPU. When a device code will try to read the memory, a page fault will occur, and the Unified Memory driver will then unmap the memory from the host, map it to the GPU and will do the actual copying of the data. You will see this process in the Visual Profiler as Data Migration. We start from straight-forward use of the unified memory:
cudaMallocManaged(&a, size); cudaMallocManaged(&b, size); cudaMallocManaged(&c, size); // fill a and b with input data - memcpy kernel<<<...>>>(a, b, c); cudaDeviceSynchronize(); // access the memory on CPU – memcpy c to CPU memory
The result leaves space for improvement: total duration is 85 msec.
By zooming into the timeline, you can see a lot of individual data “migrations”:
Note that the Visual Profiles can show the migrations in 2 forms – show each instance separately, as in the image above, or show fixed-size segments in the timeline (the default mode), where the color of the segment would indicate how much time during this segment the data migration was happening.
According to “Unified Memory for CUDA Beginners” blog by Mark Harris (https://devblogs.nvidia.com/unified-memory-cuda-beginners/, highly recommended to read) we can cudaMemPrefetchAsync to load the memory to the GPU before invoking the kernel. I feel that it’s a bit of cheating – I don’t need Unified Memory to make explicit memory copies. And eventually I could not apply the trick – it’s not supported by my GPU (the cudaDevAttrConcurrentManagedAccess attribute is 0).
I also tried to modify the code to have less CUDA “threads” and have each thread handle multiple vector entries, hoping that this will play better with the memory migration mechanism, but it gave me very similar results.
Discussion
Test Setup
I ran the tests on a Quadro M2000M GPU (for laptops), which is based on Maxwell architecture. I also tried them on GeForce GTX 1080i (newer Pascal architecture) and got very similar results, with all of the durations reduced by certain factor.
The kernel was extremely simple, and as the computational part gets more complicated, the relative performance loss of inefficient memory copies reduces, and convenience becomes significant factor in choosing the right approach.
Same goes for system complexity – as we start splitting the work wo multiple GPU devices, we will benefit more from not taking care for memory transfers by ourselves. I’ve seen many benchmarks that show very similar performance of Unified Memory and hand-crafted memory copies, so my tests may not be representative for many use cases.
More on Unified Memory
In my little test the GPU was accessing every cell in the input/output vectors. But consider an algorithm that only needs a sparse/selective data access, but we don’t know in advance which parts of the vectors will be requested – in the “classical” approach we would still need to copy all the data to the device’s Global memory, while the Unified memory mechanism would manage to only copy the relevant bits.
Another useful implication of Unified memory that kernels can operate on a data which is larger than the size of the device memory, you can imagine the Global device memory being a “cache” to the whole CPU memory. It’s true that a “cache fault” in this setup is costly, but there may be no better alternatives.
Unified Memory is expected to perform very badly when multiple devices access same memory page in parallel. The worst case I can imagine is having two integers reside in the same memory page, have CPU read one of them in a loop, while the GPU would access the second. It may look like each uses a separate variable, but in fact it will cause endless page faults on both sides and the page will be migrated back and forth.
There are more aspects to Unified Memory not covered here at all: functions like cudaMemAdvise, which come to help the driver select the best strategy, algorithms for automatic prediction of which pages are going to be requested next.
Results
Manual memory copying: gives the best performance, requires extra work to make things run in parallel, different techniques may be needed depending on the features of your device, e.g. whether it supports concurrent copying to and from GPU memory (2 DMA engines). Asynchronous copy operations require pinned memory.
Zero copy: convenience comes with a price, both in performance and in the need to grow the non-pageable memory size. Advantage: existing memory, that may even come from an external code, can be pinned and used as zero-copy.
Unified memory: even more convenient than Zero-copy feature, especially on complex multi-GPU systems. But requires a special memory allocation, can’t “upgrade” pre-allocated memory to become Unified.
Considering all the test limitations discussed, I believe that certain conclusions can still be made – in many cases the optimal solution requires explicit memory management, and it’s an integral part of the overall algorithm design. As C++ developers, we are not surprised!