Data Marshalling
Data marshalling describes how data is transferred between the host (CPU) and the device (GPU or accelerator). Understanding marshalling is crucial for achieving good performance.
Default Marshalling Rules
Arrays
Arrays are the most common data type in kernel code:
| Array Type | Default Behavior |
|---|---|
float[], double[] | Copied to device, copied back |
int[], long[] | Copied to device, copied back |
| Struct arrays | Copied element-by-element |
[EntryPoint]
public static void Process(float[] input, float[] output, int n)
{
// input and output are automatically:
// 1. Copied to device before kernel launch
// 2. Kernel executes
// 3. Copied back to host after kernel completes
}
Structs
Structs are marshalled inline (by value):
public struct Vector3
{
public float X, Y, Z;
}
[EntryPoint]
public static void Transform(Vector3[] points, int n)
{
// Vector3[] is marshalled as contiguous memory
}
Only blittable structs are supported. Structs containing references (strings, objects) cannot be marshalled.
Memory Transfer Diagram
Pinned vs Copied Memory
Pageable Memory (Default)
Standard .NET arrays use pageable memory:
- Data is copied to a staging buffer
- Then copied to device memory
- Safe but slower
Pinned Memory
For better performance, use pinned (page-locked) memory:
// Allocate pinned memory
IntPtr pinnedPtr = cuda.MallocHost(size);
float[] pinnedArray = /* wrap pinnedPtr */;
// Use with kernel - no copy needed!
wrapper.Kernel(pinnedArray, output, n);
// Free when done
cuda.FreeHost(pinnedPtr);
Benefits of pinned memory:
- Faster transfers: Direct DMA access
- Async copies possible: Overlap compute and transfer
- Lower latency: No intermediate staging
Async Copies and Streams
For maximum performance, overlap data transfer with computation:
// Create streams
cudaStream_t stream1, stream2;
cuda.StreamCreate(out stream1);
cuda.StreamCreate(out stream2);
// Async operations
cuda.MemcpyAsync(d_a, h_a, size, cudaMemcpyHostToDevice, stream1);
wrapper.SetStream(stream1).Kernel(d_a, d_b, n);
cuda.MemcpyAsync(h_b, d_b, size, cudaMemcpyDeviceToHost, stream1);
// Synchronize when needed
cuda.StreamSynchronize(stream1);
Ownership and Lifetimes
| Scenario | Ownership |
|---|---|
| Kernel parameter | Runtime manages transfer |
| Manual allocation | User manages lifetime |
| Pinned memory | User must free |
Do not modify host arrays while a kernel is running asynchronously. This can cause undefined behavior.
Advanced: Custom Marshalling
For complex scenarios, you can implement custom marshalling:
[StructLayout(LayoutKind.Sequential)]
public struct CustomData
{
public IntPtr DataPointer;
public int Length;
// Custom marshalling logic
}
Performance Tips
| Tip | Impact |
|---|---|
| Minimize transfers | High |
| Use pinned memory | Medium-High |
| Batch small transfers | Medium |
| Use async copies | Medium |
| Keep data on device | High |
The best transfer is no transfer at all. Keep data on the device between kernel calls when possible.
Next Steps
- Invoke Generated Code — Launch configuration
- Memory & Profiling — CUDA memory optimization
- Manage Memory — Best practices