FAQ & Troubleshooting

Build Errors

"CUDA toolkit not found"

Cause: nvcc is not on your PATH.

Fix:

Verify CUDA is installed: check C:\Program Files\NVIDIA GPU Computing Toolkit\CUDA\v12.x\
Add the bin directory to your system PATH
Restart Visual Studio

nvcc --version   # Should print CUDA version

"Hybridizer satellite DLL not found"

Cause: The build step that generates the CUDA DLL didn't run.

Fix:

Ensure the Hybridizer NuGet packages are installed
Do a full Rebuild (not just Build)
Check bin/Debug/ or bin/Release/ for *_CUDA.dll

"Unsupported IL pattern"

Cause: Your kernel code uses a C# feature not supported on GPU.

Fix: Check the Known Limitations. Common culprits:

foreach → use for loop
new MyClass() → use structs or pass from host
string operations → remove from kernel
Exception handling → remove try/catch

Runtime Errors

"No CUDA-capable device detected"

nvidia-smi    # Check if GPU is detected

If GPU appears: Update drivers from nvidia.com/drivers
If GPU doesn't appear: Check hardware (PCIe seating, power connector)

"Out of memory" on GPU

Cause: Arrays too large for GPU memory.

Fix:

Check GPU memory with nvidia-smi
Reduce array size or process in chunks
Use [In]/[Out] to reduce concurrent allocations

"Invalid device function" or "Launch failed"

Cause: Compiled for wrong GPU architecture.

Fix: Ensure your CUDA toolkit version matches your GPU:

RTX 30xx → Compute 8.6+ → CUDA 11+
RTX 40xx → Compute 8.9+ → CUDA 12+

Wrong Results

GPU result differs from CPU

Checklist:

Did you call cuda.DeviceSynchronize()? Without it, results may not be copied back yet
Check [In]/[Out]: Wrong direction = wrong data
Floating-point precision: GPU may execute operations in different order. Use tolerance-based comparison:
```
if (Math.Abs(gpu - cpu) > 1e-5f) // Not: if (gpu != cpu)
```
Race condition? Test with OMP backend (HybRunner.OMP()) — if OMP works but CUDA doesn't, you have a parallelization bug

First kernel call returns zeros

The first call may include CUDA context initialization. Try:

wrapper.MyKernel(args);           // Warmup
cuda.DeviceSynchronize();
// Now the real call
wrapper.MyKernel(args);
cuda.DeviceSynchronize();

Performance

GPU is slower than CPU

This is normal for:

Small arrays (< 10K elements): Transfer overhead dominates
First call: CUDA context initialization takes ~200ms
Already memory-bound: If CPU is already at peak bandwidth

Optimization checklist:

Use [In]/[Out] attributes
Increase problem size (>100K elements)
Use [IntrinsicFunction] for math operations
Profile with nsys profile or ncu

How to measure kernel time accurately

// Warmup
wrapper.MyKernel(args);
cuda.DeviceSynchronize();

// Measure
var sw = Stopwatch.StartNew();
wrapper.MyKernel(args);
cuda.DeviceSynchronize();    // Include sync in timing!
sw.Stop();

Console.WriteLine($"Kernel: {sw.ElapsedMilliseconds} ms");

Frequently Asked Questions

Can I use Hybridizer without a GPU?

Yes! Use the OMP backend: HybRunner.OMP(). It runs generated code on CPU with OpenMP threads.

What C# features are supported?

Most of C# works: classes (as host types), structs, arrays, generics, interfaces, static methods. See Known Limitations for what's not supported in kernel code.

How does Hybridizer compare to writing CUDA C++ directly?

Hybridizer typically achieves 90-98% of hand-written CUDA performance. The main advantages are:

Single-source C# (no separate .cu files)
Automatic data marshalling
Same code runs on CPU and GPU
Debug with standard .NET tools

Can I mix Hybridizer with existing CUDA code?

Yes. Use [IntrinsicFunction] and [IntrinsicInclude] to call existing CUDA functions from Hybridizer kernels.

Does Hybridizer support multi-GPU?

Use cuda.SetDevice(deviceId) before creating HybRunner. Each HybRunner instance targets one GPU.

Build Errors​

"CUDA toolkit not found"​

"Hybridizer satellite DLL not found"​

"Unsupported IL pattern"​

Runtime Errors​

"No CUDA-capable device detected"​

"Out of memory" on GPU​

"Invalid device function" or "Launch failed"​

Wrong Results​

GPU result differs from CPU​

First kernel call returns zeros​

Performance​

GPU is slower than CPU​

How to measure kernel time accurately​

Frequently Asked Questions​

Can I use Hybridizer without a GPU?​

What C# features are supported?​

How does Hybridizer compare to writing CUDA C++ directly?​

Can I mix Hybridizer with existing CUDA code?​

Does Hybridizer support multi-GPU?​

Build Errors

"CUDA toolkit not found"

"Hybridizer satellite DLL not found"

"Unsupported IL pattern"

Runtime Errors

"No CUDA-capable device detected"

"Out of memory" on GPU

"Invalid device function" or "Launch failed"

Wrong Results

GPU result differs from CPU

First kernel call returns zeros

Performance

GPU is slower than CPU

How to measure kernel time accurately

Frequently Asked Questions

Can I use Hybridizer without a GPU?

What C# features are supported?

How does Hybridizer compare to writing CUDA C++ directly?

Can I mix Hybridizer with existing CUDA code?

Does Hybridizer support multi-GPU?