Performance Analysis Tools¶
Evaluating application performance portability across diverse computing architectures often requires the aid of performance analysis tools. Such tools provide detailed information and statistics characterizing an application's usage of the architecture, and can guide the developer as she optimizes bottlenecks to achieve higher performance.
Each ASCR facility is equipped with a wide range of tools for measuring application performance. The applications running at the three facilities exhibit a broad range of demands from computer architectures - some are limited by memory bandwidth, others by latency, and others still by the CPU itself. The performance measurement tools available at the ASCR facilities can measure in detail how an application uses each of these resources. They include, but are not limited to, the list provided below. The description of each tool is copied from its official documentation.
- Allinea MAP: Allinea MAP is the profiler for parallel, multithreaded or single threaded C, C++, Fortran and F90 codes. It provides in depth analysis and bottleneck pinpointing to the source line.
- Cray Performance Measurement and Analysis Tools: The Cray Performance Measurement and Analysis Tools (or CrayPat) are a suite of utilities that enable the user to capture and analyze performance data generated during the execution of a program on a Cray system. The information collected and analysis produced by use of these tools can help the user to find answers to two fundamental programming questions: How fast is my program running? and How can I make it run faster?
- HPCToolkit: HPCToolkit is an integrated suite of tools for measurement and analysis of program performance on computers ranging from multicore desktop systems to the nation's largest supercomputers. By using statistical sampling of timers and hardware performance counters, HPCToolkit collects accurate measurements of a program's work, resource consumption, and inefficiency and attributes them to the full calling context in which they occur. HPCToolkit works with multilingual, fully optimized applications that are statically or dynamically linked.
- Intel Advisor: Intel Advisor is used early in the process of adding vectorization into your code, or while converting parts of a serial program to a parallel (multithreaded) program. It helps you explore and locate areas in which the optimizations might provide significant benefit. It also helps you predict the costs and benefits of adding vectorization or parallelism to those parts of your program, allowing you to experiment.
- Intel VTune Amplifier: Intel VTune Amplifier is a performance analysis tool targeted for users developing serial and multithreaded applications.
- nvprof: nvprof enables the collection of a timeline of CUDA-related activities on both CPU and GPU, including kernel execution, memory transfers, memory set and CUDA API calls and events or metrics for CUDA kernels.
- Tuning and Analysis Utilities (TAU): TAU Performance System is a portable profiling and tracing toolkit for performance analysis of parallel programs written in Fortran, C, C++, UPC, Java, Python.
Using Tools on ASCR Facility Systems¶
Below are brief instructions and links to documentation or presentations on using some of the performance analysis tools on the current systems.
NERSC's documentation on MAP explains the software environment setup, how to run MAP on Cori using the GUI client or command-line mode. It also discusses looking at profiling results in the GUI.
NERSC's documentation on CrayPAT explains how to set up and use CrayPAT on Cori. It includes hot to use the Cray Apprentice2 GUI to visualize performance data and Cray Reveal for loopmark and source code analysis.
NERSC's documentation on Advisor explains how to use it on Cori, including how to launch jobs and how to use the GUI to view results.
Intel VTune Amplifier¶
NERSC's documentation on
explains how to use VTune Amplifier XE on Cori, including module setup,
-dynamic, and compiling with
-g. It also has example job
scripts for collecting different kinds of profiling data and a section on
using the VTUne GUI. ```
explains the basic setup and gives example
aprun syntax for running under
Allinea MAP. You use this syntax in the
aprun command in your Cobalt job
A Cray presentation given at ALCF describes how to load appropriate modules and build your code to use CrayPAT, both the "lite" and full versions. You then run your code using the normal Cobalt job script. Note that you must first load the module to select the Cray programming environment (and compile/recompile your code with that environment). The default is the Intel environment, so if you have not changed it here is a command to switch to the Cray environment:
module swap PrgEnv-intel PrgEnv-cray
Mark Krentel's presentation
has quick start information for using HPCToolkit on Theta. In the
command in your job script, you insert
hpcstruct before your executable
Intel VTune Amplifier¶
ALCF's documentation has basic steps to use VTune on Theta, with an example run script. It also explains how to selectively profile a subset of all MPI ranks.
Tuning and Analysis Utilities (TAU)¶
For all TAU usage modes, you should first load the TAU module:
module load tau
The Hands-On section of Sameer Shende's
illustrates using TAU on Theta via a Cobalt interactive session (
-I). You may also run a normal batch job, inserting the
before your executable program name in the
aprun command in your Cobalt
batch script. To use TAU without recompiling your code, you must have linked
it as a dynamic executable (link using
-dynamic), and you should have
compiled and linked with
-g. To use with compiler and/or explicit
source-code instrumentation, you should compile using the TAU compiler
wrappers as explained in the presentation.
OLCF's documentation on
CrayPAT includes a
10-step usage guide for basic analysis of your program on Titan using
CrayPAT. It explains using the
pat_report tool for text reports and
Apprentice2 for GUI analysis. There are more details on the OLCF CrayPAT
OLCF's documentation on accelerator performance tools explains how to set up your environment and run using the NVPROF profiler to gather performance data from the GPUs.