Efficiently allocating resources in sbatch scripts

Something that trips me up literally every time I write my own submissions scripts is slurm's method of allocation.

If I later find the resources that I used to get to this understanding, I'll come back and put them here.\

An example submission script:

This is the job submission file taken from my HOMME-E3SM custom install on greatlakes.

#!/bin/bash
#
#SBATCH --job-name ${JOBNAME}
#SBATCH --account=cjablono1
#SBATCH --nodes=1
#SBATCH --ntasks-per-node=36
#SBATCH --cpus-per-task=1
#SBATCH --mem-per-cpu=1000m 
#SBATCH --time=0:60:00
#
# 25 nodes, 30min sufficient for all 5 runs
# 12 nodes, 10min for r400 an r100
# 

source /home/owhughes/HOMME/E3SM/setup.sh
export OMP_NUM_THREADS=6
export MV2_ENABLE_AFFINITY=0
NCPU=36
EXEC=../../../test_execs/theta-l-nlev30/theta-l-nlev30



function run { 
local NCPU=$1
echo "NCPU = $NCPU"
namelist=namelist-$prefix.nl
\cp -f $namelist input.nl
date
mpirun -bind-to=core -np 36 $EXEC < input.nl
date

ncl plot-baroclinicwave-init.ncl
ncl plot-lat-lon-TPLSPS.ncl 'var_choice=1'
ncl plot-lat-lon-TPLSPS.ncl 'var_choice=2'
ncl plot-lat-lon-TPLSPS.ncl 'var_choice=3'
\mv -f plot_baroclinicwave_init.pdf  ${prefix}_init.pdf
\mv -f preqx-test16-1latlonT850.pdf  ${prefix}_T850.pdf
\mv -f preqx-test16-1latlonPS.pdf  ${prefix}_PS.pdf
\mv -f preqx-test16-1latlonPRECL.pdf  ${prefix}_PRECL.pdf

\mv -f movies/dcmip2016_test11.nc    movies/${prefix}_dcmip2016_test11.nc
}

prefix=r400    ; run $(($NCPU>384?384:NCPU))

prefix=r100-dry; run $NCPU
prefix=r100-h  ; run $NCPU
prefix=r100    ; run $NCPU

prefix=r50    ; run $NCPU

As far as I can tell this indicates that ntasks-per-node dictates the number of tasks that can be directly spawned by mpirun. -bind-to=core seems to play better with hyperthreaded processors? THIS IS THE IMPORTANT ONE!

However, one can increase $OMP_NUM_THREADS seemingly ad infinitum without it complaining; seemingly it can support more than 36 threads at a time on one node. (Note: I haven't directly checked that it's allocating only one node to my job).

However, this appears to be the correct approach to allocating resources as all jobs from ne4 up to ne60 seem to run reasonably fast.

Note: When -bind-to=core is applied to MPAS, it runs in a much more sensible time frame and complains less. My guess is that the default is that it tries to bind to a socket, and that causes it to hang, or at least eat up an absurdly wasteful amount of resources.

seff ${JOBID}

Evidently you can do seff ${JOBID}

Parent post: