Uses a variant of C++ and is what runs on the GPU. Loops here become implicit. The loop induction variable is retrieved from the threadIdx.

Note

Syntactically, probably look up how to do everything.

The things permitted here will vary depending on the release of the kernel compiler (NVCC).

The compiled kernel is turned into PTX instructions (Assembly) for the GPU.

Warning

When you change machines, remember to recompile! nvcc

On ecetesla2 you may need to run nvcc -compiler-bindir /usr/bin/gcc-6 -ptx <CUDA FILE>

The kernel has none of the guarantees that we get in Rust.

What belongs in the kernel

  • Loops as long as they are not sequential
  • You can have max a three dimensional problem, otherwise you need to add some loops
  • You can try to flatten the dimension of your program, a rectangular array is just a linear array really… This might make things faster

Note that all branches that any thread in a warp can execute will be executed (no branch prediction etc), this can be slow. Make sure your if elses are not redundant as they will both execute. A loop will cause the workgroup to wait for the maximum number of iterations of the loop in any work item. The compiler will try to unroll loops if possible.

Atomics

Race conditions can still occur (shared memory). This means that you need to use atomic functions if you want to do something tot he same location. The usual operations apply here.


Host Code

This launches our CUDA kernel to do work. We want to minimize the amount of unsafe code that we write. In ECE459, we write our host code in Rust using rustacuda

  • Start the API using rustacuda::init(CudaFlags::empty())?;
  • Then we need to get a device (the graphics card, or whatever does the work) Device::get_device(0)?;
  • Now we need to get a context, this is basically a process. Context::create_and_push(ContextFlags::MAP_HOST | ContextFlags::SCHED_AUTO, device)?; we often don’t actively use this directly, but we need to be in possession of a context
  • Then we read in the compiled PTX file CString::new(include_str!("PATH"))?;
  • Then we load this into a module Module::load_from_string(&ptx)?;
  • Then we create a stream where we assign work Stream::new(StreamFlags::NON_BLOCKING, None)?; At this point generic setup is done, we need to now transfer data between memory and the device using device buffers to send stuff to the kernel. We need device buffers for in and out.

We need to launch the kernel using an unsafe block launch!(module.FUNCTION<<BLOCKS (POINTS OF WORK PER THREAD), THREADS, 0, stream>>) and include the device buffers as device pointers and then make sure all work is done before we continue since the kernel launch is async. When the synchronization is done, we want to copy the items out of the buffer back into main memory.

Example…

#[macro_use]
extern crate rustacuda;
 
use rustacuda::prelude::*;
use std::error::Error;
use std::ffi::CString;
 
fn main() -> Result<(), Box<dyn Error>> {
    // Set up the context, load the module, and create a stream to run kernels in.
    rustacuda::init(CudaFlags::empty())?;
    let device = Device::get_device(0)?;
    let _ctx = Context::create_and_push(ContextFlags::MAP_HOST | ContextFlags::SCHED_AUTO, device)?;
 
    let ptx = CString::new(include_str!("../resources/add.ptx"))?;
    let module = Module::load_from_string(&ptx)?;
    let stream = Stream::new(StreamFlags::DEFAULT, None)?;
 
    // Create buffers for data
    let mut in_x = DeviceBuffer::from_slice(&[1.0f32; 10])?;
    let mut in_y = DeviceBuffer::from_slice(&[2.0f32; 10])?;
    let mut out_1 = DeviceBuffer::from_slice(&[0.0f32; 10])?;
 
    // This kernel adds each element in 'in_x' and 'in_y' and writes the result into 'out'.
    unsafe {
        // Launch the kernel with one block of one thread, no dynamic shared memory on 'stream'.
        let result = launch!(module.sum<<<1, 1, 0, stream>>>(
            in_x.as_device_ptr(),
            in_y.as_device_ptr(),
            out_1.as_device_ptr(),
            out_1.len()
        ));
        result?;
    }
 
    // Kernel launches are asynchronous, so we wait for the kernels to finish executing.
    stream.synchronize()?;
 
    // Copy the results back to host memory
    let mut out_host = [0.0f32; 10];
    out_1.copy_to(&mut out_host[0..10])?;
 
    for x in out_host.iter() {
        assert_eq!(3.0 as u32, *x as u32);
    }
 
    println!("Launched kernel successfully.");
    Ok(())
}

Kernel…

extern "C" __constant__ int my_constant = 314;
 
extern "C" __global__ void sum(const float* x, const float* y, float* out, int count) {
    for (int i = blockIdx.x * blockDim.x + threadIdx.x; i < count; i += blockDim.x * gridDim.x) {
        out[i] = x[i] + y[i];
    }
}

CUDA Slower than Parallel CPU

If you notice that using CUDA is slower than parallel CPU bound execution time, figure out where time is spent, if its not on overhead, then we’re not taking advantage of the available hardware. The number of threads per block should always be a multiple of 32 with a maximum of 512. Numbers like 256 and 128 are good ones to start with. Then you need to adjust the grid size: divide the number of points by the threads per block to give the number of grids. Make sure the number of grids divides evenly by 256. The only things you should need to change are inside the angle brackets of the CUDA function invocation.

Also if your indexing strategy relies on blockIdx.x which only works when each work item gets its own block, you need to change code in the kernel. You need a better calculation for the position that we are looking at which is threadIdx.x + blockIdx.x * blockim.x; to calculate the correct offset. We need to account for the size of the block and the thread within the block we are using.

Also note that NVIDIA GeForce gaming GPU’s don’t natively support FP64 (64 floating point types). Stay away from these unless we use data center GPUs.