Source: https://www.youtube.com/watch?v=NQ-0D5Ti2dc&t=27s
heterogeneous: CPU + GPU

CUDA C

extends ANSI C with minimal new syntax
Terminology: CPU=host, GPU=device
- CUDA C source can be mixture of host & device code
- device code functions: kernels
  - grid of threads: many threads are launched to execute a kernel
  - don’t be afraid of launching many threads
    - e.g. one thread per (output) tensor element
CPU & GPU code runs concurrently (overlapped)

Example: Vector addition

Naive GPU vector addition:
1. Allocate device memory for vectors
2. Transfer inputs host → device
3. Launch kernel and perform additions
4. Copy device → host back
5. Free device memory
Normally, we keep data on the GPU as long as possible to asynchronously schedule many kernel launches.
For vector addition, we can calculate the array index of the thread:
- int i = blockIdx.x * blockDim.x + threadIdx.x;
  - select the block + assign position within the block

CUDA code

General strategy: replace loop by grid of threads
To keep in mind:
- Data sizes might not perfectly divisible by block sizes: always check bounds
- Prevent threads of boundary block to read/write outside allocated memory

// compute vector sum C=A+B
// each thread performs one pairwise addition
 
__global__
void vecAddKernel(float* A, float *B, float* C, int n){
	int i = blockIdx.x * blockDim.x + threadIdx.x;
	if (i < n){
	C[i] = A[i] + B[i];
	}
}

🤖 Harold's Notes

Explorer

PMPP book - Heterogenous data parallel computing

CUDA C

Example: Vector addition

CUDA code

Graph View

Table of Contents

Backlinks