Understanding the Result
You've run your first kernel. Now let's understand what happened, how to debug issues, and how to optimize data transfers.
What Hybridizer Generated
When you build your project, Hybridizer:
- Reads your .NET assembly (
.dll) - Finds methods marked with
[EntryPoint] - Generates CUDA C++ source code
- Compiles it with
nvccinto a native library - Links it at runtime via
HybRunner
You can inspect the generated code in your project's build output directory:
bin/Debug/
├── YourProject.dll ← Your .NET code
├── YourProject_CUDA.cu ← Generated CUDA source
├── YourProject_CUDA.dll ← Compiled GPU binary
└── ...
Reading the generated .cu file is the best way to understand what Hybridizer does. Open it — you'll see your C# translated to standard CUDA C++.
Note: the generated .cu source file is only available in the Enterprise edition.
Error Checking
GPU errors are silent by default. Always check for them:
// After any GPU call
wrapper.VectorAdd(a, b, result, N);
// Check for launch errors
cudaError_t err = cuda.GetPeekAtLastError();
if (err != cudaError_t.cudaSuccess)
{
Console.Error.WriteLine($"Kernel launch error: {cuda.GetErrorString(err)}");
return;
}
// Wait for completion and check
err = cuda.DeviceSynchronize();
if (err != cudaError_t.cudaSuccess)
{
Console.Error.WriteLine($"Kernel execution error: {cuda.GetErrorString(err)}");
return;
}
Or use the shorthand:
wrapper.VectorAdd(a, b, result, N);
cuda.ERROR_CHECK(cuda.DeviceSynchronize());
DeviceSynchronize — Why It Matters
GPU kernels run asynchronously. When wrapper.VectorAdd(...) returns, the GPU might still be working:
wrapper.VectorAdd(a, b, result, N);
// ⚠️ result[] might not be ready yet!
Console.WriteLine(result[0]); // Could print 0 (old value)
cuda.DeviceSynchronize();
// ✅ Now result[] is guaranteed to be read back
Console.WriteLine(result[0]); // Correct value
Always call cuda.DeviceSynchronize() before reading results. Forgetting this is the #1 cause of "wrong results" bugs.
Optimize Transfers with [In] / [Out]
By default, Hybridizer copies every array both ways (host↔device). That's wasteful:
// Without attributes: a, b, result all copied both ways = 6 transfers
wrapper.VectorAdd(a, b, result, N);
Use marshalling attributes to specify direction:
using System.Runtime.InteropServices;
[EntryPoint]
public static void VectorAdd(
[In] float[] a, // Host → Device only (read-only)
[In] float[] b, // Host → Device only (read-only)
[Out] float[] result, // Device → Host only (write-only)
int N)
{
// ...
}
| Attribute | Transfer | Use When |
|---|---|---|
| (none) | ↔ Both ways | Array is read AND written |
[In] | → Host to Device | Array is read-only on GPU |
[Out] | ← Device to Host | Array is write-only on GPU |
Impact: with 1M floats (4 MB each), proper attributes save 12 MB of PCI-e transfer.
Debug with OMP Backend
Don't have a GPU? Or want to debug with breakpoints? Use the OMP backend:
// Instead of GPU...
dynamic wrapper = HybRunner.Cuda();
// Use OpenMP (runs on CPU, same code path)
dynamic wrapper = HybRunner.OMP();
This runs the same generated code on CPU with OpenMP threads. Useful for:
- Machines without GPU
- Setting breakpoints in the generated code
- Verifying numerical correctness
If results are correct with OMP but wrong with CUDA, the bug is likely a parallelization issue (race condition, missing sync).
First Profiling
Quick Timing
var sw = System.Diagnostics.Stopwatch.StartNew();
wrapper.VectorAdd(a, b, result, N);
cuda.DeviceSynchronize();
sw.Stop();
Console.WriteLine($"GPU time: {sw.ElapsedMilliseconds} ms");
The first call is always slow (CUDA context initialization). Measure the second call for accurate timing.
NVIDIA Nsight
For detailed profiling, use NVIDIA's tools:
# Command-line profiling
ncu --set full YourProject.exe
# System-wide timeline
nsys profile YourProject.exe
Recap: The Debugging Checklist
When something goes wrong:
- ✅ Check
cuda.ERROR_CHECK(cuda.DeviceSynchronize()) - ✅ Compare GPU result with CPU reference
- ✅ Try OMP backend — if it works, the issue is parallelization
- ✅ Inspect the generated
.cufile - ✅ Make sure
DeviceSynchronize()is called before reading results - ✅ Check
[In]/[Out]attributes match your usage
Next
Now let's learn the real skill — transforming existing CPU code for GPU: From CPU to GPU →