Understanding the Result

You've run your first kernel. Now let's understand what happened, how to debug issues, and how to optimize data transfers.

What Hybridizer Generated

When you build your project, Hybridizer:

Reads your .NET assembly (.dll)
Finds methods marked with [EntryPoint]
Generates CUDA C++ source code
Compiles it with nvcc into a native library
Links it at runtime via HybRunner

You can inspect the generated code in your project's build output directory:

bin/Debug/
├── YourProject.dll          ← Your .NET code
├── YourProject_CUDA.cu      ← Generated CUDA source
├── YourProject_CUDA.dll     ← Compiled GPU binary
└── ...

tip

Reading the generated .cu file is the best way to understand what Hybridizer does. Open it — you'll see your C# translated to standard CUDA C++.

Note: the generated .cu source file is only available in the Enterprise edition.

Error Checking

GPU errors are silent by default. Always check for them:

// After any GPU call
wrapper.VectorAdd(a, b, result, N);

// Check for launch errors
cudaError_t err = cuda.GetPeekAtLastError();
if (err != cudaError_t.cudaSuccess)
{
    Console.Error.WriteLine($"Kernel launch error: {cuda.GetErrorString(err)}");
    return;
}

// Wait for completion and check
err = cuda.DeviceSynchronize();
if (err != cudaError_t.cudaSuccess)
{
    Console.Error.WriteLine($"Kernel execution error: {cuda.GetErrorString(err)}");
    return;
}

Or use the shorthand:

wrapper.VectorAdd(a, b, result, N);
cuda.ERROR_CHECK(cuda.DeviceSynchronize());

`DeviceSynchronize` — Why It Matters

GPU kernels run asynchronously. When wrapper.VectorAdd(...) returns, the GPU might still be working:

wrapper.VectorAdd(a, b, result, N);
// ⚠️ result[] might not be ready yet!
Console.WriteLine(result[0]);  // Could print 0 (old value)

cuda.DeviceSynchronize();
// ✅ Now result[] is guaranteed to be read back
Console.WriteLine(result[0]);  // Correct value

warning

Always call cuda.DeviceSynchronize() before reading results. Forgetting this is the #1 cause of "wrong results" bugs.

Optimize Transfers with `[In]` / `[Out]`

By default, Hybridizer copies every array both ways (host↔device). That's wasteful:

// Without attributes: a, b, result all copied both ways = 6 transfers
wrapper.VectorAdd(a, b, result, N);

Use marshalling attributes to specify direction:

using System.Runtime.InteropServices;

[EntryPoint]
public static void VectorAdd(
    [In]  float[] a,       // Host → Device only (read-only)
    [In]  float[] b,       // Host → Device only (read-only)
    [Out] float[] result,  // Device → Host only (write-only)
    int N)
{
    // ...
}

Attribute	Transfer	Use When
(none)	↔ Both ways	Array is read AND written
`[In]`	→ Host to Device	Array is read-only on GPU
`[Out]`	← Device to Host	Array is write-only on GPU

Impact: with 1M floats (4 MB each), proper attributes save 12 MB of PCI-e transfer.

Debug with OMP Backend

Don't have a GPU? Or want to debug with breakpoints? Use the OMP backend:

// Instead of GPU...
dynamic wrapper = HybRunner.Cuda();

// Use OpenMP (runs on CPU, same code path)
dynamic wrapper = HybRunner.OMP();

This runs the same generated code on CPU with OpenMP threads. Useful for:

Machines without GPU
Setting breakpoints in the generated code
Verifying numerical correctness

tip

If results are correct with OMP but wrong with CUDA, the bug is likely a parallelization issue (race condition, missing sync).

First Profiling

Quick Timing

var sw = System.Diagnostics.Stopwatch.StartNew();

wrapper.VectorAdd(a, b, result, N);
cuda.DeviceSynchronize();

sw.Stop();
Console.WriteLine($"GPU time: {sw.ElapsedMilliseconds} ms");

info

The first call is always slow (CUDA context initialization). Measure the second call for accurate timing.

NVIDIA Nsight

For detailed profiling, use NVIDIA's tools:

# Command-line profiling
ncu --set full YourProject.exe

# System-wide timeline
nsys profile YourProject.exe

Recap: The Debugging Checklist

When something goes wrong:

✅ Check cuda.ERROR_CHECK(cuda.DeviceSynchronize())
✅ Compare GPU result with CPU reference
✅ Try OMP backend — if it works, the issue is parallelization
✅ Inspect the generated .cu file
✅ Make sure DeviceSynchronize() is called before reading results
✅ Check [In]/[Out] attributes match your usage

Now let's learn the real skill — transforming existing CPU code for GPU: From CPU to GPU →

What Hybridizer Generated​

Error Checking​

DeviceSynchronize — Why It Matters​

Optimize Transfers with [In] / [Out]​

Debug with OMP Backend​

First Profiling​

Quick Timing​

NVIDIA Nsight​

Recap: The Debugging Checklist​

Next​