From CPU to GPU
This tutorial takes existing C# code and transforms it for GPU execution, step by step. At each step, we measure the improvement.
Inspired by the Hybridizer "From Zero to Hero" approach.
The Problem: Apply a Function to Every Element
We want to apply f(x) = sin(x) * cos(x) + sqrt(abs(x)) to every element of a large array:
Step 0: Sequential C#
static void Compute(float[] input, float[] output, int N)
{
for (int i = 0; i < N; i++)
{
float x = input[i];
output[i] = (float)(Math.Sin(x) * Math.Cos(x) + Math.Sqrt(Math.Abs(x)));
}
}
This runs on a single core. For 16 million elements, it takes ~800 ms.
Step 1: Parallel.For (CPU Multi-Core)
The easiest win — use all CPU cores:
static void Compute(float[] input, float[] output, int N)
{
Parallel.For(0, N, i =>
{
float x = input[i];
output[i] = (float)(Math.Sin(x) * Math.Cos(x) + Math.Sqrt(Math.Abs(x)));
});
}
Time: ~150 ms on an 8-core CPU. That's a 5× speedup with one line changed.
Step 2: Add [EntryPoint]
Now, make it GPU-ready:
[EntryPoint]
public static void Compute(
[In] float[] input,
[Out] float[] output,
int N)
{
for (int i = threadIdx.x + blockDim.x * blockIdx.x;
i < N;
i += blockDim.x * gridDim.x)
{
float x = input[i];
output[i] = (float)(Math.Sin(x) * Math.Cos(x) + Math.Sqrt(Math.Abs(x)));
}
}
Changes:
- Added
[EntryPoint] - Added
[In]/[Out]for transfer optimization - Replaced the loop with a grid-stride loop
// Launch
cuda.GetDeviceProperties(out cudaDeviceProp prop, 0);
dynamic wrapper = HybRunner.Cuda()
.SetDistrib(prop.multiProcessorCount * 16, 256);
wrapper.Compute(input, output, N);
cuda.DeviceSynchronize();
Time: ~12 ms (including transfers). That's 67× faster than sequential!
Step 3: Use Fast Math Intrinsics
The GPU has special function units (SFU) for math. Map to them:
[IntrinsicFunction("sinf")]
public static float Sinf(float x) => (float)Math.Sin(x);
[IntrinsicFunction("cosf")]
public static float Cosf(float x) => (float)Math.Cos(x);
[IntrinsicFunction("sqrtf")]
public static float Sqrtf(float x) => (float)Math.Sqrt(x);
[IntrinsicFunction("fabsf")]
public static float Fabsf(float x) => Math.Abs(x);
[EntryPoint]
public static void Compute([In] float[] input, [Out] float[] output, int N)
{
for (int i = threadIdx.x + blockDim.x * blockIdx.x;
i < N;
i += blockDim.x * gridDim.x)
{
float x = input[i];
output[i] = Sinf(x) * Cosf(x) + Sqrtf(Fabsf(x));
}
}
Each [IntrinsicFunction] maps a C# method to a CUDA hardware instruction:
- The body is the CPU fallback (normal Math functions)
- On GPU, the fast hardware instruction is used instead
Time: ~5 ms. Even faster.
Summary Table
| Step | Change | Time | Speedup |
|---|---|---|---|
| 0. Sequential | Baseline | 800 ms | 1× |
1. Parallel.For | One line | 150 ms | 5× |
2. [EntryPoint] + GPU | Grid-stride loop | 12 ms | 67× |
3. [IntrinsicFunction] | Fast GPU math | 5 ms | 160× |
Rules for Porting
✅ Do
- Remove loop side effects: no
i++on pointers, no external state mutation in the loop body - Use
[In]/[Out]: saves transfer time - Use intrinsics for math:
expf,logf,sinf,sqrtf - Match iteration to thread: each thread handles
i, stride by total threads
❌ Don't
- Don't allocate with
newin kernels: heap allocation is extremely slow on GPU. UseStackArray<T>instead - Don't use
string,List<T>,Dictionary: reference types aren't supported on GPU - Don't use
Console.WriteLinein kernels (useprintfvia[IntrinsicFunction]if needed for debug) - Don't forget
DeviceSynchronize: results aren't ready until sync
StackArray for GPU-local Storage
If you need a temporary array inside a kernel:
// ❌ BAD: heap allocation — very slow on GPU
var buffer = new float[64];
// ✅ GOOD: stack allocation — fast, uses registers/cache
var buffer = new StackArray<float>(64);
Exercise
Take this CPU code and port it to GPU:
static void Normalize(float[] data, int N)
{
float sum = 0;
for (int i = 0; i < N; i++)
sum += data[i];
for (int i = 0; i < N; i++)
data[i] /= sum;
}
Hint: you'll need two kernels — one for the sum (reduction) and one for the division.
Next
Let's apply these skills to a visual example: Working with Images →