Skip to main content

GPU Memory & Profiling

Understanding GPU memory hierarchy and profiling is essential for writing high-performance code. This chapter covers memory transfers, profiling tools, and optimization techniques.

CPU-GPU Memory Model

CPU-GPU Memory

Usually, data comes from the hard drive or network. It has to be moved to the GPU for processing. While GPU can have direct access to system memory (using pinned or managed memory), data still has to move through PCIe or NVLink.

Basic Memory Operations

int main() 
{
const int N = 1024 * 1024 * 32;
double *h_a, *d_a;
h_a = new double[N];

// Allocate memory on device
cudaMalloc(&d_a, N * sizeof(double));

// Move memory from host to device
cudaMemcpy(d_a, h_a, N * sizeof(double), cudaMemcpyHostToDevice);

// Invoke kernel on device
add<<<32, 256>>>(d_a, 1.0, N);

// Copy memory back to host
cudaMemcpy(h_a, d_a, N * sizeof(double), cudaMemcpyDeviceToHost);

// Wait for device tasks to complete
cudaDeviceSynchronize();

// Use result...
cudaFree(d_a);
delete[] h_a;
}

Hybridizer Memory Management

With Hybridizer, memory management is largely automatic:

Automatic Marshalling

// Arrays are automatically transferred
wrapper.MyKernel(hostArray, n);
// Results are automatically copied back

Profiling Tools

NVIDIA provides several profiling tools:

ToolDescriptionBest For
Nsight SystemsSystem-wide timelineOverall performance
Nsight ComputeKernel-level analysisDetailed optimization
nvprof (legacy)Command-line profilerQuick checks

Timeline View

The timeline view shows execution flow:

Simple Timeline

Key observations:

  • First cudaMalloc takes time (CUDA context initialization)
  • Memory copies (H2D, D2H) often dominate execution time
  • Kernel execution (small brown bar) can be very fast
info

Moving memory is expensive. Minimize transfers as much as possible.

Bandwidth Analysis

The memory tab shows effective bandwidth:

Poor Bandwidth

In this example, the kernel achieves 275.63 GB/s (78% of peak on GTX 1080 Ti).

Issue Efficiency

The issue efficiency view reveals optimization opportunities:

Issue Efficiency

Observations:

  • Only 8-16 warps active per SM (limit is 64)
  • 85% of time stalled on memory dependency
  • Global memory latency is ~400 cycles

Optimizing Occupancy

By increasing the number of blocks, we can saturate the GPU:

cudaDeviceProp prop;
cudaGetDeviceProperties(&prop, 0);
int smCount = prop.multiProcessorCount;

// Launch 8 blocks per SM with 256 threads each
// 8 × 256 = 2048 threads = 64 warps per SM
add<<<8 * smCount, 256>>>(d_a, 1.0, N);

Result with optimal configuration:

Good Bandwidth

  • SMs saturated with enough warps
  • Achieved bandwidth: 328.39 GB/s (93% of peak)

Best Practices

PracticeImpact
Minimize host-device transfersHigh
Use pinned memory for faster transfersMedium
Overlap compute and transfer (streams)Medium
Ensure high occupancyMedium
Optimize memory access patternsHigh

Memory Hierarchy

Memory TypeScopeSpeedSize
RegistersThreadFastestLimited
Shared MemoryBlockVery Fast48-96 KB
L1 CacheSMFast48-128 KB
L2 CacheDeviceMedium4-40 MB
Global MemoryDeviceSlow8-80 GB

Next Steps