Optimize Kernels

This guide provides practical techniques for optimizing Hybridizer-generated kernels for maximum performance.

Optimization Workflow

1. Launch Configuration

Block Size Selection

// Common block sizes
int blockSize = 256;  // Good default for most kernels

// For compute-heavy kernels
int blockSize = 128;  // More registers available per thread

// For memory-heavy kernels
int blockSize = 512;  // More threads to hide latency

Grid Size Calculation

cudaDeviceProp prop;
cuda.GetDeviceProperties(out prop, 0);

// Ensure enough blocks to saturate GPU
int minBlocks = prop.multiProcessorCount * 8;
int gridSize = Math.Max(
    (N + blockSize - 1) / blockSize,
    minBlocks
);

wrapper.SetDistrib(gridSize, blockSize);

2. Memory Access Optimization

Ensure Coalescence

Good (coalesced):

// Consecutive threads access consecutive addresses
int i = threadIdx.x + blockIdx.x * blockDim.x;
data[i] = value;  // ✅ Coalesced

Bad (strided):

// Threads access with stride
int i = threadIdx.x * stride + blockIdx.x;
data[i] = value;  // ❌ Not coalesced

Structure of Arrays (SoA)

// Bad: Array of Structures (AoS)
struct Particle { float x, y, z; }
Particle[] particles;  // ❌ Poor memory access

// Good: Structure of Arrays (SoA)
float[] x, y, z;  // ✅ Coalesced access per component

Use Shared Memory

[EntryPoint]
public static void WithSharedMemory(float[] input, float[] output, int N)
{
    // Declare shared memory
    HybridArray<float> shared = new HybridArray<float>(blockDim.x, HybridArrayFlags.Shared);
    
    int tid = threadIdx.x;
    int i = threadIdx.x + blockIdx.x * blockDim.x;
    
    // Load to shared memory
    if (i < N)
        shared[tid] = input[i];
    
    CUDAIntrinsics.SyncThreads();
    
    // Work with shared memory (fast access)
    // ...
    
    // Write result
    if (i < N)
        output[i] = shared[tid];
}

3. Reduce Divergence

Avoid Divergent Branches

Bad (divergent within warp):

if (threadIdx.x % 2 == 0)
    DoSomething();  // Half threads idle
else
    DoOther();  // Other half idle

Better (uniform per warp):

if (blockIdx.x % 2 == 0)
    DoSomething();  // All threads in block do same
else
    DoOther();

Loop Unrolling

// Compiler hint for unrolling
[Kernel]
public static float Sum4(float[] data, int start)
{
    // Manual unroll for small known counts
    return data[start] + data[start+1] + data[start+2] + data[start+3];
}

4. Use Fast Math

When precision permits:

// Standard precision
double result = Math.Sin(x) * Math.Exp(y);

// Fast approximations (GPU intrinsics)
float result = FastMath.sinf(x) * FastMath.expf(y);

5. Minimize Synchronization

// Bad: Sync after every operation
for (int step = 0; step < N; step++)
{
    shared[tid] = compute();
    CUDAIntrinsics.SyncThreads();  // ❌ Too many syncs
}

// Better: Batch work between syncs
for (int step = 0; step < N; step += 4)
{
    // Do 4 steps of work
    shared[tid] = compute4Steps();
    CUDAIntrinsics.SyncThreads();  // ✅ Fewer syncs
}

Optimization Checklist

Category	Check	Impact
Launch	Block size 128-512	Medium
Launch	Grid saturates SMs	High
Memory	Coalesced access	High
Memory	Shared mem for reuse	Medium
Compute	Minimize divergence	Medium
Compute	Fast math where ok	Low
Sync	Minimize barriers	Medium

Profiling Commands

# Check occupancy
ncu --metrics sm__warps_active.avg.pct_of_peak_sustained_active ./app

# Check memory efficiency
ncu --metrics l1tex__t_bytes_pipe_lsu_mem_global_op_ld.sum.per_second ./app

# Full analysis
ncu --set full -o profile ./app

Next Steps

Performance Metrics — Understanding metrics
Memory & Profiling — Profiling tools
Manage Memory — Memory best practices

Optimization Workflow​

1. Launch Configuration​

Block Size Selection​

Grid Size Calculation​

2. Memory Access Optimization​

Ensure Coalescence​

Structure of Arrays (SoA)​

Use Shared Memory​

3. Reduce Divergence​

Avoid Divergent Branches​

Loop Unrolling​

4. Use Fast Math​

5. Minimize Synchronization​

Optimization Checklist​

Profiling Commands​

Next Steps​