Comparison of Systems

Below we compare in depth the Cori, Theta and Titan systems, software environment and job submission process to aid office of science users in utilizing multiple resources.

Hardware In-Depth

System-> Cori Theta Titan
Facility NERSC ALCF OLCF
Model Cray XC40 Cray XC40 Cray XK7
Processor Intel Xeon Phi 7250 ("Knights Landing") Intel Xeon Phi 7230 ("Knights Landing") AMD Opteron 6274 ("Interlagos")
Processor Cores 68 64 16 CPU cores (2668 (896) SP (DP) CUDA cores on K20X GPU)
Processor Base Frequency 1.4 GHz 1.3 GHz 2.2 GHz
Processor Max Frequency 1.6 GHz 1.5 GHz 3.1 GHz (disabled)
On-Device Memory 16 GB MCDRAM 16 GB MCDRAM (6 GB GDDR5 on K20X GPU)
Processor DRAM 96 GB DDR4 192 GB DDR4 32 GB DDR3
Accelerator (none) (none) NVIDIA Tesla K20X ("Kepler")
Nodes 9 688 3 624 18 688
Perf. Per Node 2.6 TF 2.6 TF 1.4 TF
Node local storage (none) 128 GB SSD (none)
External Burst Buffer 1.8 PB (none) (none)
Parallel File System 30 PB Lustre 10 PB Lustre 28 PB Lustre
Interconnect Cray Aries Cray Aries Cray Gemini
Topology Dragonfly Dragonfly 3D torus
Peak Perf 30 PF 10 PF 27 PF

Software Environment

System-> Cori Theta Titan
Software environment management modules modules modules
Batch Job Scheduler Slurm Cobalt PBS
Compilers
Intel (default) module load PrgEnv-intel (default) module load PrgEnv-intel module load PrgEnv-intel
Cray module load PrgEnv-cray module load PrgEnv-cray module load PrgEnv-cray
GNU module load PrgEnv-gnu module load PrgEnv-gnu module load PrgEnv-gnu
PGI n/a n/a (default) module load PrgEnv-pgi
CLANG n/a module load PrgEnv-llvm n/a
Interpreters
R gcc + MKL: module load R
Cray: module load cray-R
module load cray-R module load r
Python 2 Anaconda + Intel MKL: module load python/2.7-anaconda Cray: module load cray-python
Intel: module load intelpython26
module load python_anaconda
Python 3 Anaconda + Intel MKL: module load python/3.5-anaconda Intel: module load intelpython35 module load python_anaconda3
Libraries
FFT FFTW: module load fftw
Cray FFTW: module load cray-fftw
Intel MKL: automatic with Intel compilers
FFTW: module load fftw
Cray FFTW: module load cray-fftw
Intel MKL: automatic with Intel compilers
FFTW: module load fftw
Cray FFTW: module load cray-fftw
Cray LibSci (default) module load cray-libsci module load cray-libsci module load cray-libsci
Intel MKL automatic with Intel compilers automatic with Intel compilers automatic with Intel compilers
Trilinos module load cray-trilinos module load cray-trilinos module load cray-trilinos
PETSc module load cray-petsc module load cray-petsc module load cray-petsc
SHMEM module load cray-shmem module load cray-shmem module load cray-shmem
memkind module load cray-memkind module load cray-memkind n/a
I/O Libraries
HDF5 module load cray-hdf5 module load cray-hdf5 module load cray-hdf5
NetCDF module load cray-netcdf module load cray-netcdf module load cray-netcdf
Parallel NetCDF module load cray-parallel-netcdf module load cray-parallel-netcdf module load cray-parallel-netcdf
Performance Tools and APIs
Intel VTune Amplifier module load vtune source /opt/intel/vtune_amplifier_xe/amplxe-vars.sh n/a
CrayPAT module load perftools-base && module load perftools module load perftools module load perftools
PAPI module load papi module load papi module load papi
Darshan (default) module load darshan module load cray-memkind module load darshan
Other Packages and Frameworks
Shifter (part of base system) module load shifter n/a

Compiler Wrappers

Use these wrappers to properly cross-compile your source code for the compute nodes of the systems, and bring in appropriate headers for MPI, etc.

System-> Cori Theta Titan
C++ CC CC CC
C cc cc cc
Fortran ftn ftn ftn

Job Submission

Theta

Job Script

#!/bin/bash
#COBALT -t 30
#COBALT --attrs mcdram=cache:numa=quad
#COBALT -A <yourALCFProjectName>
echo "Starting Cobalt job script"
export n_nodes=$COBALT_JOBSIZE
export n_mpi_ranks_per_node=32
export n_mpi_ranks=$(($n_nodes * $n_mpi_ranks_per_node))
export n_openmp_threads_per_rank=4
export n_hyperthreads_per_core=2
export n_hyperthreads_skipped_between_ranks=4
aprun -n $n_mpi_ranks -N $n_mpi_ranks_per_node \
  --env OMP_NUM_THREADS=$n_openmp_threads_per_rank -cc depth \
  -d $n_hyperthreads_skipped_between_ranks \
  -j $n_hyperthreads_per_core \
  <executable> <executable args>

The #COBALT -t 30 line indicates 30 minutes runtime. Generally, #COBALT lines are equivalent to specifying qsub command-line arguments.

Job Submit Command

qsub -n 512 ./theta_script.sh
The -n 512 argument requests 512 nodes.

Titan

Job Script

#!/bin/bash
#PBS -A <yourOLCFProjectName>
#PBS -N test
#PBS -j oe
export n_nodes=$JOBSIZE
export n_mpi_ranks_per_node=8
export n_mpi_ranks=$(($n_nodes * $n_mpi_ranks_per_node))

cd $MEMBERWORK/<yourOLCFProjectName>
date

export OMP_NUM_THREADS=2

aprun -n $n_mpi_ranks -N $n_mpi_ranks_per_node \
  -d 2  <executable> <executable args>

Job Submit Command

qsub -l nodes=512 ./theta_script.sh
The -l nodes=512 argument requests 512 nodes (this can also be put in the batch script).

Cori

NERSC provides a page in the MyNERSC website which generates job scripts automatically based on specified runtime configurations. An example script is shown below, in which a code uses 512 nodes of Xeon Phi with MCDRAM configured in "flat" mode, with 4 MPI processes per node and 34 OpenMP threads per MPI process, using 2 hyper-threads per physical core of Xeon Phi:

Job Script

#!/bin/bash
#SBATCH -N 512
#SBATCH -C knl,quad,flat
#SBATCH -p debug
#SBATCH -J myapp_run1
#SBATCH --mail-user=johndoe@nersc.gov
#SBATCH --mail-type=ALL
#SBATCH -t 00:30:00

#OpenMP settings:
export OMP_NUM_THREADS=34
export OMP_PLACES=threads
export OMP_PROC_BIND=spread


#run the application:
srun -n 2048 -c 68 --cpu_bind=cores numactl -p 1 myapp.x