Skip to main content

Constant Memory

Sample source: 5.CUDA_runtime/ConstantMemory

This example shows how to place read-only data in CUDA constant memory — a cached, broadcast-optimized memory space ideal for coefficients and lookup tables.

GPU Memory Hierarchy

MemoryScopeSpeedSizeUse Case
RegistersThreadFastest~255 per threadVariables
SharedBlockVery fast48-96 KBShared work
ConstantDeviceCached, broadcast64 KBCoefficients
GlobalDeviceSlowGBsArrays
tip

Constant memory is broadcast to all threads in a warp simultaneously. When all 32 threads read the same address, it's as fast as a register access.

Declaring Constant Memory

[HybridConstant(Location = ConstantLocation.ConstantMemory)]
public static float[] data = [-2.0f, -1.0f, 0.0f, 1.0f, 2.0f];

This array is placed in CUDA __constant__ memory at compile time.

Stencil Kernel Using Constants

[EntryPoint]
public static void Run([Out] float[] output, [In] float[] input, int N)
{
for (int k = 2 + threadIdx.x + blockDim.x * blockIdx.x;
k < N - 2;
k += blockDim.x * gridDim.x)
{
float tmp = 0;
for (int p = -2; p <= 2; ++p)
{
tmp += data[p + 2] * input[k];
}
output[k] = tmp;
}
}

All threads read the same data[p + 2] values — perfect for constant memory broadcast.

Launch

HybRunner runner = SatelliteLoader.Load();
dynamic wrapped = runner.Wrap(new Program());
wrapped.Run(output, input, N);

No special configuration needed — the [HybridConstant] attribute handles it.

When to Use Constant Memory

✅ Good for❌ Bad for
Stencil coefficientsLarge lookup tables (> 64 KB)
Physical constantsPer-thread unique data
Small lookup tablesFrequently updated data
Filter kernelsSparse access patterns

Next Steps