Invoke Generated Code

This chapter explains how to load and invoke Hybridizer-generated kernels from your host application.

Using HybRunner

The HybRunner class is the primary interface for invoking generated code:

using Hybridizer.Runtime.CUDAImports;

// Create a runner for CUDA
dynamic wrapper = HybRunner.Cuda();

// Configure launch parameters
wrapper.SetDistrib(gridSize, blockSize);

// Invoke the kernel
wrapper.MyKernel(param1, param2, param3);

Launch Configuration

Grid and Block Dimensions

CUDA kernels require a launch configuration specifying how many blocks and threads to use:

// Get device properties for optimal configuration
cudaDeviceProp prop;
cuda.GetDeviceProperties(out prop, 0);

// Common patterns
int blockSize = 256;  // Threads per block
int gridSize = prop.multiProcessorCount * 16;  // Blocks

wrapper.SetDistrib(gridSize, blockSize);

Configuration Guidelines

Workload Size	Recommended Configuration
Small (< 10K)	32 blocks × 128 threads
Medium (10K-1M)	SM count × 8 × 256 threads
Large (> 1M)	SM count × 16 × 256 threads

tip

Use enough threads to saturate the GPU. A good rule of thumb is at least 10,000 threads for modern GPUs.

Stream Management

Streams allow overlapping computation and data transfer:

// Create a stream
cudaStream_t stream;
cuda.StreamCreate(out stream);

// Associate stream with runner
wrapper.SetStream(stream);

// Kernel runs asynchronously
wrapper.MyKernel(a, b, c);

// Do other work on CPU...

// Wait for completion
cuda.StreamSynchronize(stream);

// Cleanup
cuda.StreamDestroy(stream);

Multiple Streams

cudaStream_t[] streams = new cudaStream_t[4];
for (int i = 0; i < 4; i++)
{
    cuda.StreamCreate(out streams[i]);
}

// Launch kernels on different streams
for (int i = 0; i < 4; i++)
{
    wrapper.SetStream(streams[i]);
    wrapper.ProcessChunk(data, i * chunkSize, chunkSize);
}

// Wait for all
for (int i = 0; i < 4; i++)
{
    cuda.StreamSynchronize(streams[i]);
}

Error Handling

Always check for errors after kernel invocation:

// Launch kernel
wrapper.MyKernel(a, b, n);

// Check for launch errors
cudaError_t launchError = cuda.GetLastError();
if (launchError != cudaError_t.cudaSuccess)
{
    Console.WriteLine($"Launch error: {cuda.GetErrorString(launchError)}");
}

// Synchronize and check for execution errors
cudaError_t syncError = cuda.DeviceSynchronize();
if (syncError != cudaError_t.cudaSuccess)
{
    Console.WriteLine($"Execution error: {cuda.GetErrorString(syncError)}");
}

Common Errors

Error	Cause	Solution
`cudaErrorInvalidConfiguration`	Bad grid/block size	Check dimensions
`cudaErrorLaunchOutOfResources`	Too many threads	Reduce block size
`cudaErrorIllegalAddress`	Out-of-bounds access	Check array indices

Synchronization Points

When to Synchronize

Scenario	Synchronization
Read results	Required
Between dependent kernels	Usually automatic
Timing measurements	Required
Error checking	Recommended

OMP and AVX Invocation

For non-CUDA backends, the invocation is similar:

// OMP backend
dynamic ompWrapper = HybRunner.OMP();
ompWrapper.SetDistrib(Environment.ProcessorCount, 1);
ompWrapper.MyKernel(a, b, c);

// AVX backend
dynamic avxWrapper = HybRunner.AVX();
avxWrapper.MyKernel(a, b, c);

Next Steps

Data Marshalling — Memory transfer details
Line Info & Debug — Debugging kernels
CUDA Backend — CUDA-specific features

Using HybRunner​

Launch Configuration​

Grid and Block Dimensions​

Configuration Guidelines​

Stream Management​

Multiple Streams​

Error Handling​

Common Errors​

Synchronization Points​

When to Synchronize​

OMP and AVX Invocation​

Next Steps​