Performance Metrics

Understanding performance metrics is essential for optimizing Hybridizer-generated kernels. This chapter covers key metrics and how to use them for tuning.

Key Metrics Overview

Metric	Description	Target
Occupancy	Ratio of active warps to max warps	> 50%
SM Utilization	How busy the SMs are	> 80%
Memory Bandwidth	Data transfer rate	> 70% of peak
Compute Throughput	Operations per second	Depends on workload

Occupancy

Occupancy = Active Warps / Maximum Warps per SM

Factors limiting occupancy:

Factor	Impact
Registers per thread	More registers → fewer threads
Shared memory per block	More shared mem → fewer blocks
Block size	Suboptimal size → wasted slots

Checking Occupancy

// Get device properties
cudaDeviceProp prop;
cuda.GetDeviceProperties(out prop, 0);

// Calculate occupancy
int maxThreadsPerSM = prop.maxThreadsPerMultiProcessor;
int threadsPerBlock = 256;
int blocksPerSM = maxThreadsPerSM / threadsPerBlock;

Console.WriteLine($"Theoretical occupancy: {blocksPerSM * threadsPerBlock * 100.0 / maxThreadsPerSM}%");

Memory Throughput

Bandwidth Calculation

Effective Bandwidth = (Bytes Read + Bytes Written) / Time

Peak Bandwidth by GPU

GPU	Memory	Peak Bandwidth
GTX 1080 Ti	GDDR5X	484 GB/s
RTX 3090	GDDR6X	936 GB/s
A100	HBM2e	2039 GB/s
H100	HBM3	3350 GB/s

Achieved vs Peak

Achieved %	Assessment
> 80%	Excellent (memory bound is optimal)
50-80%	Good
25-50%	Room for improvement
< 25%	Investigate access patterns

Roofline Model

The roofline model helps identify whether your kernel is:

Compute bound: Limited by arithmetic operations
Memory bound: Limited by data transfer

Arithmetic Intensity = FLOPs / Bytes Transferred

Practical Guidelines

Arithmetic Intensity	Bound	Optimization Focus
< 5 FLOP/byte	Memory	Coalescence, caching
5-20 FLOP/byte	Transitional	Both
> 20 FLOP/byte	Compute	Instruction-level optimization

SM Utilization

SM utilization indicates how often SMs are actively executing:

Utilization	Interpretation
> 90%	Optimal
70-90%	Good
< 70%	Insufficient parallelism

Common Causes of Low Utilization

Not enough blocks: Increase grid size
Synchronization barriers: Reduce sync points
Memory stalls: Improve access patterns
Branch divergence: Reorganize conditionals

Tuning Checklist

1. Launch Configuration

// Get optimal block size
int blockSize = 256; // Good default
int gridSize = (N + blockSize - 1) / blockSize;

// Ensure enough blocks to saturate GPU
gridSize = Math.Max(gridSize, prop.multiProcessorCount * 8);

wrapper.SetDistrib(gridSize, blockSize);

2. Memory Access

Coalesced global memory access
Minimize global memory transactions
Use shared memory for data reuse
Align data to 128 bytes

3. Compute

Minimize register usage if occupancy-limited
Use fast math when accuracy permits
Avoid warp divergence
Unroll small loops

4. Data Transfer

Minimize host-device transfers
Use pinned memory
Overlap compute and transfer with streams
Keep data on GPU between kernels

Profiling Commands

# Using Nsight Compute (modern)
ncu --set full ./my_program

# Using nvprof (legacy)
nvprof --metrics achieved_occupancy,gld_efficiency ./my_program

# Key metrics to check
# - achieved_occupancy
# - sm_efficiency  
# - gld_throughput / gst_throughput
# - dram_utilization

Next Steps

Memory & Profiling — Detailed profiling examples
Optimize Kernels — Practical optimization guide
CUDA Functions — Writing efficient device code

Key Metrics Overview
Occupancy
- Checking Occupancy
Memory Throughput
Roofline Model
- Practical Guidelines
SM Utilization
- Common Causes of Low Utilization
Tuning Checklist
Profiling Commands
Next Steps

Key Metrics Overview​

Occupancy​

Checking Occupancy​

Memory Throughput​

Bandwidth Calculation​

Peak Bandwidth by GPU​

Achieved vs Peak​

Roofline Model​

Practical Guidelines​

SM Utilization​

Common Causes of Low Utilization​

Tuning Checklist​

1. Launch Configuration​

2. Memory Access​

3. Compute​

4. Data Transfer​

Profiling Commands​

Next Steps​