Skip to main content

Your First Kernel

In this tutorial, you'll write a program that adds two arrays together — first on the CPU, then on the GPU. You'll see the exact same C# code run on both.

Start with Plain C#

Here's how every C# developer would add two arrays:

static void VectorAdd(float[] a, float[] b, float[] result, int N)
{
for (int i = 0; i < N; i++)
{
result[i] = a[i] + b[i];
}
}

This runs on one CPU core, one element at a time. Let's make it parallel.

Step 1: Add [EntryPoint]

One attribute transforms this into GPU-capable code:

using Hybridizer.Runtime.CUDAImports;

class Program
{
[EntryPoint]
public static void VectorAdd(float[] a, float[] b, float[] result, int N)
{
for (int i = threadIdx.x + blockDim.x * blockIdx.x;
i < N;
i += blockDim.x * gridDim.x)
{
result[i] = a[i] + b[i];
}
}
}

What changed?

Before (CPU)After (GPU)
for (int i = 0; i < N; i++)for (int i = threadIdx.x + blockDim.x * blockIdx.x; i < N; i += blockDim.x * gridDim.x)
One thread does everythingThousands of threads share the work

Think of it like this:

  • CPU: One person reads a book from page 1 to page 1000
  • GPU: 1000 people each read one page simultaneously

The Grid-Stride Loop

threadIdx.x + blockDim.x * blockIdx.x

This computes a unique ID for each thread. With 256 threads per block and 32 blocks:

Thread 0 in Block 0 → i = 0
Thread 1 in Block 0 → i = 1
...
Thread 0 in Block 1 → i = 256
Thread 0 in Block 2 → i = 512

The stride blockDim.x * gridDim.x handles arrays larger than the total thread count.

Step 2: Launch It

static void Main()
{
const int N = 1_000_000;

// Prepare data
float[] a = new float[N];
float[] b = new float[N];
float[] result = new float[N];

for (int i = 0; i < N; i++)
{
a[i] = i;
b[i] = i * 0.5f;
}

// Create GPU wrapper
dynamic wrapper = HybRunner.Cuda().SetDistrib(128, 256);

// Run on GPU!
wrapper.VectorAdd(a, b, result, N);
cuda.DeviceSynchronize();

// Check results
Console.WriteLine($"result[0] = {result[0]}"); // 0 + 0 = 0
Console.WriteLine($"result[42] = {result[42]}"); // 42 + 21 = 63
Console.WriteLine($"result[999999] = {result[999999]}");
}

Understanding SetDistrib(128, 256)

HybRunner.Cuda().SetDistrib(128, 256);
// │ │
// │ └─ 256 threads per block
// └─ 128 blocks in the grid

Total threads: 128 × 256 = 32,768 threads working in parallel.

info

For 1 million elements with 32K threads, each thread processes about 30 elements via the grid-stride loop. That's the point of the i += blockDim.x * gridDim.x stride.

Step 3: What Happens Behind the Scenes

Hybridizer handles all memory transfers automatically. You just call the method.

Step 4: Verify GPU == CPU

Always compare GPU results with a CPU reference:

// GPU
wrapper.VectorAdd(a, b, gpuResult, N);
cuda.DeviceSynchronize();

// CPU (same code, direct call)
VectorAdd(a, b, cpuResult, N);

// Compare
int errors = 0;
for (int i = 0; i < N; i++)
{
if (Math.Abs(gpuResult[i] - cpuResult[i]) > 1e-5f)
errors++;
}
Console.WriteLine(errors == 0 ? "✅ GPU matches CPU" : $"❌ {errors} mismatches");
tip

The [EntryPoint] method is valid C# code. You can always call it directly for a CPU reference. On CPU, threadIdx.x = 0, blockIdx.x = 0, blockDim.x = 1, gridDim.x = 1 — so the loop behaves like a regular for(i=0; i<N; i++).

Exercise

Modify the kernel to compute result[i] = a[i] * b[i] + 1.0f (fused multiply-add). Verify GPU == CPU.

Next

How do you debug this? What is Hybridizer actually generating? → Understanding the Result →