Measuring Roofline Quantities on NVIDIA GPUs
It is possible to measure roofline quantities for a kernel on a GPU using the NVProf tool which was described here.
In order to plot roofline data, we need to compute arithmetic intensity as well as FLOPS which involves three quantities:
- Number of floating point operations
- Data volume moved to and from DRAM or cache
- The runtime in seconds
These can be collected with NVProf using the following steps:
1. Use NVProf to collect the time spent¶
You can use NVProf to collect time spent in a kernel you are interested in by executing something like the following:
command: nvprof --print-gpu-trace ./build/bin/hpgmg-fv 6 8 output: Time(%) Time Calls Avg Min Max Name 51.96% 2.52256s 1764 1.4300ms 1.4099ms 1.4479ms void smooth_kernel<int=7, int=16, int=4, int=16>(level_type, int, int, double, double, int, double*, double*)
2. Use the NVProf metric summary mode¶
You can use this mode and specify the target kernel to collect information such as:
- Floating point ops
- DRAM R/W transactions
- DRAM R/W throughput
An example NVProf command to execute is:
nvprof --kernels "smooth_kernel" --metrics flop_count_dp --metrics dram_read_throughput --metrics dram_write_throughput --metrics dram_read_transactions --metrics dram_write_transactions ./build/bin/hpgmg-fv 6 8
This will produce output like the following for each kernel:
Invocations Metric Name Metric Description Min Max Avg Kernel: void smooth_kernel<int=7, int=32, int=4, int=16>(level_type, int, int, double, double, int, double*, double*) 1764 flop_count_dp Floating Point Operations(Double Precision) 240648192 240648192 240648192 1764 dram_read_throughput Device Memory Read Throughput 299.98GB/s 307.48GB/s 303.72GB/s 1764 dram_write_throughput Device Memory Write Throughput 40.102GB/s 41.099GB/s 40.578GB/s 1764 dram_read_transactions Device Memory Read Transactions 4537918 4599890 4567973 1764 dram_write_transactions Device Memory Write Transactions 606387 611691 610299
You may instead replace the DRAM metrics with L2 metrics to compute a cache-based roofline. For example, replace dram_write_throughput
with
l2_write_throughput
. You can find other available metrics here.
To compute Arithmetic Intensity you can then use the following equivalent methods:
Method I:
FP / ( DR + DW ) * (size of transaction = 32 Bytes)
Method II:
FP / (TR + TW) * time taken by kernel (computed by step 1)
where,
FP = double precision ops
DR/DW= dram read/write transactions
TR/TW= dram read/write throughput