Comparison of Systems¶
Below we compare in depth the Cori, Theta and Titan systems, software environment and job submission process to aid office of science users in utilizing multiple resources.
Hardware In-Depth¶
System-> | Cori | Theta | Titan |
---|---|---|---|
Facility | NERSC | ALCF | OLCF |
Model | Cray XC40 | Cray XC40 | Cray XK7 |
Processor | Intel Xeon Phi 7250 ("Knights Landing") | Intel Xeon Phi 7230 ("Knights Landing") | AMD Opteron 6274 ("Interlagos") |
Processor Cores | 68 | 64 | 16 CPU cores (2668 (896) SP (DP) CUDA cores on K20X GPU) |
Processor Base Frequency | 1.4 GHz | 1.3 GHz | 2.2 GHz |
Processor Max Frequency | 1.6 GHz | 1.5 GHz | 3.1 GHz (disabled) |
On-Device Memory | 16 GB MCDRAM | 16 GB MCDRAM | (6 GB GDDR5 on K20X GPU) |
Processor DRAM | 96 GB DDR4 | 192 GB DDR4 | 32 GB DDR3 |
Accelerator | (none) | (none) | NVIDIA Tesla K20X ("Kepler") |
Nodes | 9 688 | 3 624 | 18 688 |
Perf. Per Node | 2.6 TF | 2.6 TF | 1.4 TF |
Node local storage | (none) | 128 GB SSD | (none) |
External Burst Buffer | 1.8 PB | (none) | (none) |
Parallel File System | 30 PB Lustre | 10 PB Lustre | 28 PB Lustre |
Interconnect | Cray Aries | Cray Aries | Cray Gemini |
Topology | Dragonfly | Dragonfly | 3D torus |
Peak Perf | 30 PF | 10 PF | 27 PF |
Software Environment¶
System-> | Cori | Theta | Titan |
---|---|---|---|
Software environment management | modules | modules | modules |
Batch Job Scheduler | Slurm | Cobalt | PBS |
Compilers | |||
Intel | (default) module load PrgEnv-intel |
(default) module load PrgEnv-intel |
module load PrgEnv-intel |
Cray | module load PrgEnv-cray |
module load PrgEnv-cray |
module load PrgEnv-cray |
GNU | module load PrgEnv-gnu |
module load PrgEnv-gnu |
module load PrgEnv-gnu |
PGI | n/a | n/a | (default) module load PrgEnv-pgi |
CLANG | n/a | module load PrgEnv-llvm |
n/a |
Interpreters | |||
R | gcc + MKL: module load R Cray: module load cray-R |
module load cray-R |
module load r |
Python 2 | Anaconda + Intel MKL: module load python/2.7-anaconda |
Cray: module load cray-python Intel: module load intelpython26 |
module load python_anaconda |
Python 3 | Anaconda + Intel MKL: module load python/3.5-anaconda |
Intel: module load intelpython35 |
module load python_anaconda3 |
Libraries | |||
FFT | FFTW: module load fftw Cray FFTW: module load cray-fftw Intel MKL: automatic with Intel compilers |
FFTW: module load fftw Cray FFTW: module load cray-fftw Intel MKL: automatic with Intel compilers |
FFTW: module load fftw Cray FFTW: module load cray-fftw |
Cray LibSci | (default) module load cray-libsci |
module load cray-libsci |
module load cray-libsci |
Intel MKL | automatic with Intel compilers | automatic with Intel compilers | automatic with Intel compilers |
Trilinos | module load cray-trilinos |
module load cray-trilinos |
module load cray-trilinos |
PETSc | module load cray-petsc |
module load cray-petsc |
module load cray-petsc |
SHMEM | module load cray-shmem |
module load cray-shmem |
module load cray-shmem |
memkind | module load cray-memkind |
module load cray-memkind |
n/a |
I/O Libraries | |||
HDF5 | module load cray-hdf5 |
module load cray-hdf5 |
module load cray-hdf5 |
NetCDF | module load cray-netcdf |
module load cray-netcdf |
module load cray-netcdf |
Parallel NetCDF | module load cray-parallel-netcdf |
module load cray-parallel-netcdf |
module load cray-parallel-netcdf |
Performance Tools and APIs | |||
Intel VTune Amplifier | module load vtune |
source /opt/intel/vtune_amplifier_xe/amplxe-vars.sh |
n/a |
CrayPAT | module load perftools-base && module load perftools |
module load perftools |
module load perftools |
PAPI | module load papi |
module load papi |
module load papi |
Darshan | (default) module load darshan |
module load cray-memkind |
module load darshan |
Other Packages and Frameworks | |||
Shifter | (part of base system) | module load shifter |
n/a |
Compiler Wrappers¶
Use these wrappers to properly cross-compile your source code for the compute nodes of the systems, and bring in appropriate headers for MPI, etc.
System-> | Cori | Theta | Titan |
---|---|---|---|
C++ | CC |
CC |
CC |
C | cc |
cc |
cc |
Fortran | ftn |
ftn |
ftn |
Job Submission¶
Theta¶
Job Script¶
#!/bin/bash #COBALT -t 30 #COBALT --attrs mcdram=cache:numa=quad #COBALT -A <yourALCFProjectName> echo "Starting Cobalt job script" export n_nodes=$COBALT_JOBSIZE export n_mpi_ranks_per_node=32 export n_mpi_ranks=$(($n_nodes * $n_mpi_ranks_per_node)) export n_openmp_threads_per_rank=4 export n_hyperthreads_per_core=2 export n_hyperthreads_skipped_between_ranks=4 aprun -n $n_mpi_ranks -N $n_mpi_ranks_per_node \ --env OMP_NUM_THREADS=$n_openmp_threads_per_rank -cc depth \ -d $n_hyperthreads_skipped_between_ranks \ -j $n_hyperthreads_per_core \ <executable> <executable args>
The #COBALT -t 30
line indicates 30 minutes runtime. Generally, #COBALT
lines are equivalent to specifying qsub
command-line arguments.
Job Submit Command¶
qsub -n 512 ./theta_script.sh
-n 512
argument requests 512 nodes.
Titan¶
Job Script¶
#!/bin/bash #PBS -A <yourOLCFProjectName> #PBS -N test #PBS -j oe export n_nodes=$JOBSIZE export n_mpi_ranks_per_node=8 export n_mpi_ranks=$(($n_nodes * $n_mpi_ranks_per_node)) cd $MEMBERWORK/<yourOLCFProjectName> date export OMP_NUM_THREADS=2 aprun -n $n_mpi_ranks -N $n_mpi_ranks_per_node \ -d 2 <executable> <executable args>
Job Submit Command¶
qsub -l nodes=512 ./theta_script.sh
-l nodes=512
argument requests 512 nodes (this can also be put in the batch script).
Cori¶
NERSC provides a page in the MyNERSC website which generates job scripts automatically based on specified runtime configurations. An example script is shown below, in which a code uses 512 nodes of Xeon Phi with MCDRAM configured in "flat" mode, with 4 MPI processes per node and 34 OpenMP threads per MPI process, using 2 hyper-threads per physical core of Xeon Phi:
Job Script¶
#!/bin/bash #SBATCH -N 512 #SBATCH -C knl,quad,flat #SBATCH -p debug #SBATCH -J myapp_run1 #SBATCH [email protected] #SBATCH --mail-type=ALL #SBATCH -t 00:30:00 #OpenMP settings: export OMP_NUM_THREADS=34 export OMP_PLACES=threads export OMP_PROC_BIND=spread #run the application: srun -n 2048 -c 68 --cpu_bind=cores numactl -p 1 myapp.x