Shared Memory & Reduction

The reduction — computing a single value from a large array (sum, max, min…) — is the most instructive GPU pattern. It teaches you the three GPU-specific concepts you'll use throughout your Hybridizer career:

Shared memory — fast per-block storage
Thread synchronization — __syncthreads
Atomic operations — safe cross-block communication

Full sample: 1.Simple/Reduction

The Problem

Sum 32 million integers. On CPU:

int sum = 0;
for (int i = 0; i < N; i++)
    sum += a[i];

Simple, but sequential. On GPU, we need to divide and conquer.

The Strategy

Step 1: Each thread sums its portion of the array
Step 2: Threads in the same block combine their results (shared memory)
Step 3: Each block writes its result to global memory (atomic)

The Kernel

[EntryPoint]
public static void ReduceAdd(int N, [In] int[] a, [Out] int[] result)
{
    // 1️⃣ Allocate shared memory for this block
    var cache = new SharedMemoryAllocator<int>().allocate(blockDim.x);

    int tid = threadIdx.x + blockDim.x * blockIdx.x;
    int cacheIndex = threadIdx.x;

    // 2️⃣ Each thread accumulates its portion
    int tmp = 0;
    while (tid < N)
    {
        tmp += a[tid];
        tid += blockDim.x * gridDim.x;  // grid-stride
    }

    // Store in shared memory
    cache[cacheIndex] = tmp;

    // 3️⃣ Synchronize — wait for ALL threads in this block
    CUDAIntrinsics.__syncthreads();

    // 4️⃣ Tree reduction within shared memory
    int i = blockDim.x / 2;
    while (i != 0)
    {
        if (cacheIndex < i)
        {
            cache[cacheIndex] += cache[cacheIndex + i];
        }
        CUDAIntrinsics.__syncthreads();
        i >>= 1;
    }

    // 5️⃣ Thread 0 writes this block's result
    if (cacheIndex == 0)
    {
        Interlocked.Add(ref result[0], cache[0]);
    }
}

Let's understand each part.

Part 1: Shared Memory

var cache = new SharedMemoryAllocator<int>().allocate(blockDim.x);

What is shared memory?

A small (48-96 KB), very fast memory shared by all threads in a block
~100× faster than global memory
Think of it as a programmer-controlled L1 cache

Memory Type	Speed	Scope	Size
Registers	Fastest	One thread	~255 per thread
Shared	Very fast	One block	48-96 KB
Global	Slow	All threads	GBs

On CPU (.NET), SharedMemoryAllocator<T> falls back to a regular array.

Part 2: Grid-Stride Accumulation

int tmp = 0;
while (tid < N)
{
    tmp += a[tid];
    tid += blockDim.x * gridDim.x;
}
cache[cacheIndex] = tmp;

Each thread walks the array with a stride equal to the total thread count. With 256 threads × 128 blocks = 32K threads:

Thread 0 sums elements 0, 32768, 65536, ...
Thread 1 sums elements 1, 32769, 65537, ...

Each thread's result goes into shared memory.

Part 3: `__syncthreads`

CUDAIntrinsics.__syncthreads();

This is a barrier. It means: "all 256 threads in this block must reach this line before any can continue."

Without it, Thread 0 might start reading cache[128] before Thread 128 has written to it → data race.

warning

Every thread in the block must call __syncthreads(). Putting it inside an if block where some threads skip it causes a deadlock.

Part 4: Tree Reduction

int i = blockDim.x / 2;   // i = 128
while (i != 0)
{
    if (cacheIndex < i)
        cache[cacheIndex] += cache[cacheIndex + i];
    CUDAIntrinsics.__syncthreads();
    i >>= 1;               // i = 64, 32, 16, 8, 4, 2, 1
}

This combines 256 values into 1 in just 8 steps (log₂ 256):

Step 1: 256 values → 128 (threads 0-127 each add their partner)
Step 2: 128 values → 64
Step 3: 64 → 32
Step 4: 32 → 16
Step 5: 16 → 8
Step 6: 8 → 4
Step 7: 4 → 2
Step 8: 2 → 1   ← final result in cache[0]

Part 5: Atomic Add

if (cacheIndex == 0)
    Interlocked.Add(ref result[0], cache[0]);

Each block produces one partial sum. Interlocked.Add safely accumulates them:

Maps to atomicAdd on GPU
Thread-safe (hardware-guaranteed)
Standard .NET API (System.Threading.Interlocked)

Launching the Kernel

Shared memory size must be declared explicitly:

const int BLOCK_DIM = 256;

cuda.GetDeviceProperties(out cudaDeviceProp prop, 0);
dynamic wrapper = HybRunner.Cuda()
    .SetDistrib(
        16 * prop.multiProcessorCount, 1,   // grid: many blocks
        BLOCK_DIM, 1, 1,                    // block: 256 threads
        BLOCK_DIM * sizeof(int)             // shared memory: 1 KB
    );

int[] result = new int[1];
wrapper.ReduceAdd(N, a, result);
cuda.DeviceSynchronize();

The last parameter (BLOCK_DIM * sizeof(int)) tells CUDA how much shared memory to allocate per block.

Verification

int expected = a.Aggregate((x, y) => x + y);
Console.WriteLine($"GPU:      {result[0]}");
Console.WriteLine($"Expected: {expected}");

Recap

Concept	Hybridizer API	CUDA Equivalent	Purpose
Shared memory	`SharedMemoryAllocator<T>`	`__shared__`	Fast per-block storage
Barrier sync	`CUDAIntrinsics.__syncthreads()`	`__syncthreads()`	Wait for all threads
Atomic ops	`Interlocked.Add`	`atomicAdd`	Safe cross-block writes

What's Next

Congratulations — you've completed the tutorial series! You now understand:

✅ Installation and setup
✅ Writing kernels with [EntryPoint]
✅ Debugging, [In]/[Out], profiling
✅ Porting CPU code to GPU
✅ 2D image processing
✅ Shared memory and cooperative patterns

Where to go from here:

Examples — Browse 9 complete working samples
Generics & Performance — Write reusable, zero-cost GPU code
Optimization Guide — Squeeze every drop of performance
Black-Scholes — A real finance application

The Problem​

The Strategy​

The Kernel​

Part 1: Shared Memory​

Part 2: Grid-Stride Accumulation​

Part 3: __syncthreads​

Part 4: Tree Reduction​

Part 5: Atomic Add​

Launching the Kernel​

Verification​

Recap​

What's Next​