Lambda Reduction

Sample source: 6.Advanced/LambdaReduction

This example implements a reduction using lambdas/delegates instead of generics. It demonstrates how Hybridizer handles C# functional programming constructs — and the performance tradeoffs involved.

Inner Reduction with Func

[Kernel]
public static void InnerReduce(
    [Out] float[] result, [In] float[] input, int N,
    float neutral, Func<float, float, float> reductor)
{
    var cache = new SharedMemoryAllocator<float>().allocate(blockDim.x);
    int tid = threadIdx.x + blockDim.x * blockIdx.x;
    int cacheIndex = threadIdx.x;

    float tmp = neutral;
    while (tid < N)
    {
        tmp = reductor(tmp, input[tid]);
        tid += blockDim.x * gridDim.x;
    }

    cache[cacheIndex] = tmp;
    CUDAIntrinsics.__syncthreads();

    int i = blockDim.x / 2;
    while (i != 0)
    {
        if (cacheIndex < i)
            cache[cacheIndex] = reductor(cache[cacheIndex], cache[cacheIndex + i]);
        CUDAIntrinsics.__syncthreads();
        i >>= 1;
    }

    if (cacheIndex == 0)
        AtomicExpr.apply(ref result[0], cache[0], reductor);
}

Entry Points with Lambdas

[EntryPoint]
public static void ReduceAdd(float[] result, float[] input, int N)
{
    InnerReduce(result, input, N, 0.0f, (x, y) => x + y);
}

[EntryPoint]
public static void ReduceMax(float[] result, float[] input, int N)
{
    InnerReduce(result, input, N, float.MinValue, (x, y) => Math.Max(x, y));
}

This is elegant and concise, but there's a performance cost.

Performance Comparison

Approach	Bandwidth	% of Peak	Code Complexity
Plain code	328 GB/s	92%	High (copy-paste)
Generics	328 GB/s	92%	Medium
Lambda (optimized)	255 GB/s	72%	Low
Virtual functions	154 GB/s	43%	Medium
Lambda (naïve)	59 GB/s	17%	Low

warning

Lambda/delegate calls cannot be inlined on GPU because the function pointer is not known at compile time. This introduces indirect call overhead on every reduction step.

Optimization: Cache the Lambda

A critical optimization — save the delegate to a local variable:

[Kernel]
public void Reduce(int N, float[] a, float[] result)
{
    // Cache lambda in a register — this is the key optimization!
    Func<float, float, float> f = localReductor;

    var cache = new SharedMemoryAllocator<float>().allocate(blockDim.x);
    // ... use f instead of localReductor
}

This allows nvcc to optimize the indirect call. Without this trick, performance drops from 255 GB/s to 59 GB/s.

When to Use What

Need	Recommended Approach	Performance
Maximum performance	Generics (`[HybridTemplateConcept]`)	100%
Good performance + concise code	Lambda (with local variable trick)	~72%
Expressiveness	Virtual functions	~43%
Quick prototyping	Lambda (naïve)	~17%

Next Steps

Generic Reduction — Full-performance alternative
Generics, Virtuals, Delegates — Detailed guide
Reduction — Simpler non-generic version

Inner Reduction with Func​

Entry Points with Lambdas​

Performance Comparison​

Optimization: Cache the Lambda​

When to Use What​

Next Steps​