Skip to main content

Working with Images

Images are naturally 2D — and GPUs are built for 2D parallelism. In this tutorial, you'll apply a Sobel edge detection filter to an image.

Full sample: 2.Imaging/Sobel

What You'll Learn

  • 2D thread distribution: threadIdx.x/y, blockIdx.x/y
  • Stencil operations (reading neighbor pixels)
  • Working with images via ImageSharp
  • 2D SetDistrib configuration

The Sobel Operator

The Sobel filter detects edges by computing gradients in X and Y:

Gradient X:          Gradient Y:
-1 0 +1 -1 -2 -1
-2 0 +2 0 0 0
-1 0 +1 +1 +2 +1

For each pixel, we read its 8 neighbors, apply both kernels, and compute:

edge_strength = √(Gx² + Gy²)

Step 1: Load the Image

using SixLabors.ImageSharp;
using SixLabors.ImageSharp.PixelFormats;

// Add NuGet: SixLabors.ImageSharp
var image = Image.Load<Rgba32>("photo.png");
int width = image.Width;
int height = image.Height;

// Convert to grayscale byte array
byte[] pixels = new byte[width * height];
for (int i = 0; i < height; i++)
for (int j = 0; j < width; j++)
{
var p = image[j, i];
pixels[i * width + j] = (byte)(p.R * 0.2126 + p.G * 0.7152 + p.B * 0.0722);
}

Step 2: The 2D Kernel

[EntryPoint]
public static void ComputeSobel(
[Out] byte[] output,
[In] byte[] input,
int width, int height)
{
// Outer loop: rows (Y dimension)
for (int i = threadIdx.y + blockIdx.y * blockDim.y;
i < height;
i += blockDim.y * gridDim.y)
{
// Inner loop: columns (X dimension)
for (int j = threadIdx.x + blockIdx.x * blockDim.x;
j < width;
j += blockDim.x * gridDim.x)
{
// Skip border pixels (no neighbors)
if (i == 0 || j == 0 || i == height - 1 || j == width - 1)
{
output[i * width + j] = 0;
continue;
}

int pos = i * width + j;

// Read 8 neighbors
byte tl = input[pos - width - 1]; // top-left
byte t = input[pos - width]; // top
byte tr = input[pos - width + 1]; // top-right
byte l = input[pos - 1]; // left
byte r = input[pos + 1]; // right
byte bl = input[pos + width - 1]; // bottom-left
byte b = input[pos + width]; // bottom
byte br = input[pos + width + 1]; // bottom-right

// Apply Sobel kernels
int gx = tl + 2*l + bl - tr - 2*r - br;
int gy = tl + 2*t + tr - bl - 2*b - br;

// Magnitude
int mag = (int)Math.Sqrt(gx*gx + gy*gy);
output[pos] = (byte)Math.Min(mag, 255);
}
}
}

Key Difference: 2D Loops

Instead of one loop over i, we have two nested loops — one for rows (threadIdx.y) and one for columns (threadIdx.x).

Thread (threadIdx.x=3, threadIdx.y=5) in Block (blockIdx.x=1, blockIdx.y=2):
j = 3 + 1 * 16 = 19 (column)
i = 5 + 2 * 16 = 37 (row)
→ processes pixel at position (row=37, col=19)

Step 3: Launch with 2D Grid

// 2D grid and 2D blocks
dynamic wrapper = HybRunner.Cuda()
.SetDistrib(32, 32, 16, 16, 1, 0);
// ├──────┤ ├──────┤
// gridDim blockDim
// (blocks) (threads/block)

byte[] result = new byte[width * height];
wrapper.ComputeSobel(result, pixels, width, height);
cuda.DeviceSynchronize();
ParameterValueMeaning
Grid X3232 blocks horizontally
Grid Y3232 blocks vertically
Block X1616 threads per block (columns)
Block Y1616 threads per block (rows)

Total: 32 × 32 × 16 × 16 = 262,144 threads covering the image.

Step 4: Save the Result

var resultImage = new Image<Rgba32>(width, height);
for (int i = 0; i < height; i++)
for (int j = 0; j < width; j++)
{
byte v = result[i * width + j];
resultImage[j, i] = new Rgba32(v, v, v, 255);
}

resultImage.Save("edges.png");
Console.WriteLine("Saved edges.png");

Why GPUs Excel at Image Processing

AspectCPUGPU
Threads8-16262,144
Each pixelProcessed sequentiallyProcessed in parallel
Neighbor accessCache-friendlyCoalesced (consecutive threads read consecutive bytes)

Memory access pattern for the inner loop (j mapped to threadIdx.x):

Thread 0 reads pixel[i*W + 0]
Thread 1 reads pixel[i*W + 1]
Thread 2 reads pixel[i*W + 2]
...

These are consecutive memory addresses — the GPU loads them in a single transaction. This is called coalesced access and is critical for performance.

Going Further: Shared Memory

For even better performance, load the pixel neighborhood into shared memory first. See:

Exercise

Modify the kernel to apply a blur filter instead of edge detection. A simple 3×3 box blur averages all 9 pixels (center + 8 neighbors):

output[pos] = (tl + t + tr + l + center + r + bl + b + br) / 9;

Next

Let's tackle the most important GPU pattern: Shared Memory & Reduction →