CUDA Lambdas

Lambda closures are an integral part of modern C++, in CUDA code they can be used in different levels. At the very basic, they can be used inside the device code:

__global__ void setValueInnerLambda(cudacpp::DeviceVector<int> c, int val) {
   auto idx = threadIdx.x;
   auto op = [=](int& i) { i = val; };
   op(c[idx]);
}

Starting with CUDA 7.5 we can create __device__ lambda and use it as a parameter to __global__ kernel (NOTE: this and later examples require to add a special command-line compilation flag: –expt-extended-lambda):

template<typename T>
__global__ void applyKernel(cudacpp::DeviceVector<int> c, T op) {
   auto idx = threadIdx.x;
   op(c[idx]);
}

void testLambda(int val) {
   // ...
   auto op = [=] __device__ (int& v) { v = val; };
   applyKernel<<<1, size>>>(dev_c, op);
   // ...
}

The lambda succeeded to capture the val argument by value. In this example we used int& as parameter type, starting with CUDA 9.0 auto parameters are supported in generic lambdas.

CUDA 8.0 adds support for __device__ __host__ lambdas, which can be used both in the device and the host code – this can be useful, for example, when the decision on whether we use GPU or CPU version of the code is done based on certain runtime condition.

Another interesting use case is capturing a copy of this object using [*this] specification (while [=] will only copy the this pointer, but not the members).

struct AddValue {
   int _val;
   // ... init, etc.

   void doApplyKernel(cudacpp::DeviceMemoryT<int>& dev_c) {
      auto f = [*this] __device__ (int& i) { i = _val; };
      applyKernel<<<1, dev_c.sizeElements()>>>(dev_c, f); 
   }
};

The above code creates the lambda function and uses it in the same scope. Instead, we may want to return the lambda from the class and use it externally, like this:

struct AddValue {
   // ... continuing from previous example

   // DOES NOT COMPILE!
   auto getFunc() { 
      return [*this] __device__ (int& i) { 
         i = _val; 
      }; 
   } 
};

Unfortunately, this produces a compilation error:
error : The enclosing parent function (“getFunc”) for an extended __device__ lambda must not have deduced return type

The first attempt could be returning std::function<void(int&)> instead, but we can’t convert __device__ lambda to std::function.

As alternative the CUDA package provides nvstd::function (not surprisingly, via the <nvfunctional> header) that is capable of holding __device__ functions. But here we face another limitation: we can’t initialize nvstd::function in the host code, and pass it to device function.  That brings us to the following possible solution:

struct AddValue {
   // ... continuing from previous example

   // NOTE: 'getFunc' now becomes a __device__ function
   nvstd::function<void(int&)> __device__ getFunc() {
      return [*this] __device__ (int& i) {
         i = _val;
      };
   }
};

template<typename T>
__global__ void applyKernelGetFunc(cudacpp::DeviceVector<int> c, T op) {
   auto idx = threadIdx.x;
   op.getFunc()(c[idx]);
}

void testLambda(int val) {
   // ...
   AddValue addF{ val };
   applyKernelGetFunc<<<1, size>>>(dev_c, addF);
   // ...
}

This works as expected, but it’s worth noting that the kernel code produced is much longer (in number of binary instructions) compared to “pure” lambda versions [measured with cudobjdump, see Virtual CUDA post for usage details]. Hence use such constructs with caution, the overhead may be too costly.

All the code examples can be found at: https://github.com/mgopshtein/cudacpp/blob/master/examples/lambda.cu
In same repository you can find Visual Studio project files (VS2015, CUDA 9.1).

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s