Univa Grid Engine Pilot

Alongside an expansion of hardware for Apocrita we are migrating to Centos 7 and Univa Grid Engine, this brings a number of fixes and improvements including resource management via cgroups.

The implementation of this with some refinement of the queue configuration will result in changes to the way jobs are submitted. Currently these changes are undergoing testing before nodes are moved from the old cluster to the new.

Connecting to the new cluster

The new cluster is currently available via an alternate login node login2, the following command will allow you to login:

ssh login2.hpc.qmul.ac.uk

New Configuration

Whilst most of the cluster is open access there are a number of restricted access nodes that have been purchased by specific groups. In order to reduce idle node time on restricted nodes, the new configuration includes an open access queue across all restricted nodes not configured for infiniband with a maximum runtime of 1 hour.

Therefore an owner of a node will wait a maximum of 1 hour for use of their node.

Access to gpu nodes is on a per request basis, we currently have 4 nodes with Tesla K80 cards featuring 2 GPUs per card.

Cluster nodes

Access Node PE Count
Open 209
dn SMP 150
sm SMP 10
ccn SMP, Parallel 15
nxv Parallel 20
nxv SMP 14
Restricted 48
nxn Parallel 32
nxn SMP 6
nxv SMP 9
panos SMP 1
GPU 4
nxg SMP, Parallel 4

Queues

The multiple queues across the old cluster have been condensed into two queues, the main queue all.q and the short runtime queue short.q. Access lists and hostgroup specific settings manage the required differences between groups of hosts.

Do not specify a queue

Specifying a queue with -q is not necessary and potentially harmful to queuing times, jobs will be correctly queued based on resource requests.

Parallel Environments

Parallel environments have been simplified to smp and parallel, slots are now equivalent to cores in all environments unlike the old cluster's openmpi environment.

SMP

  • Guaranteed all slots on one node

Parallel

  • Slots can be across any number of nodes
  • Allocation attempts to fill nodes before moving to another node

IB Islands

With the addition of more infiniband nodes we now have three separate infiniband islands, when running a parallel job you will need to request a specific island if you want to run on infiniband enabled nodes.

Island Nodes Qsub
ccn ccn0-ccn16 -l infiniband=ccn
nxn nxn0-nxn31 -l infiniband=nxn
nxv nxv1-nxv16 -l infiniband=nxv

Modules

Upgrading to Centos 7 has necessitated a clean up of modules currently built on Apocrita due to the library version changes. This means we've started with a new clean build with none of the previous versions available. For a short time during the pilot and initial release, older SL6 versions are available via the module use.sl6 however this will be deprecated shortly after the pilot phase.

Using New Modules

module load <modulename>

Using Old Modules

module load use.sl6
module load <modulename>

Submitting Jobs

A basic qlogin or qsub can be run directly, with no requests specified this will result in a single core with 1GB of ram allocated and a runtime of 1 hour.

Larger jobs can be run by adding the necessary resource requests, for instance a 4 core job with 2GB of ram per core and a runtime of 10 hours can be specified as:

#$ -pe smp 4        # 4 Cores
#$ -l h_vmem=2G     # 2GB per core (total 8G)
#$ -l h_rt=10:0:0   # 10 hour runtime

./run_code.sh

Slots == Cores

Previously certain parallel environments treated the slots value differently, this has now be unified into a one to one ratio for slots to cores.

Parallel Jobs via Infiniband

Parallel jobs over infiniband require the following setups based on which infiniband island they're running on.

IB ccn

#$ -pe parallel 96      # 96 cores (48 per ccn node)
#$ -l infiniband=ccn    # Choose infiniband island
#$ -l h_rt=10:0:0       # 10 hour runtime

./run_code.sh

IB nxn

Restricted Access

Infiniband nxn nodes are restricted to certain users.

#$ -pe parallel 32      # 32 cores (16 per nxn node)
#$ -l infiniband=nxn    # Choose infiniband island (ccn nxn nxv)
#$ -l h_rt=10:0:0       # 10 hour runtime

./run_code.sh

IB nxv

#$ -pe parallel 64      # 64 cores (32 per nxv node)
#$ -l infiniband=nxv    # Choose infiniband island (ccn nxn nxv)
#$ -l h_rt=10:0:0       # 10 hour runtime

./run_code.sh

GPU

Restricted Access

Access to gpu nodes is available on request.

To request a gpu the -l gpu=<count> option should be used, note that requests are handled per node so a request for 64 cores and 2 gpus will result in 4 gpus across two nodes.

GPU Card Allocation

Ensure you set card allocation

Failure to set card allocation may result in contention with other users jobs and result in your job being killed.

Requesting cards with parallel PE

If using the parallel parallel environment requests will be exclusive, please ensure that you correctly set slots and gpu to fill the node.

Once a job starts the assigned gpu cards are listed in the SGE_HGR_gpu environment variable as a space separated list. To ensure correct use of allocated gpu cards you need to limit your computation to run only on the allocated cards.

For Cuda this can be done by exporting the CUDA_VISIBLE_DEVICES environment variable which should be a comma separated list:

$ echo $SGE_HGR_gpu
0 1
# Set CUDA_VISIBLE_DEVICES,
# this converts the space separated list into a comma separated list
$ export CUDA_VISIBLE_DEVICES=${SGE_HGR_gpu// /,}

For OpenCL, this can be done via the GPU_DEVICE_ORDINAL environment variable which should be a comma separated list:

$ echo $SGE_HGR_gpu
0 1
# Set GPU_DEVICE_ORDINAL,
# this converts the space separated list into a comma separated list
$ export GPU_DEVICE_ORDINAL=${SGE_HGR_gpu// /,}

Request one gpu (Cuda)

#$ -pe smp 16       # 16 cores (32 per nxg node)
#$ -l gpu=1         # request 1 gpu per host (2 per nxg node)
#$ -l h_rt=10:0:0   # 10 hour runtime

export CUDA_VISIBLE_DEVICES=${SGE_HGR_gpu// /,}
./run_code.sh

Request two gpus on the same box (OpenCL)

#$ -pe smp 32       # 32 cores (32 per nxg node)
#$ -l gpu=2         # request 2 gpu per host (2 per nxg node)
#$ -l h_rt=10:0:0   # 10 hour runtime

export GPU_DEVICE_ORDINAL=${SGE_HGR_gpu// /,}
./run_code.sh

Request four gpus across multiple boxes (Cuda)

#$ -pe parallel 64  # 64 cores (32 per nxg node)
#$ -l gpu=2         # request 2 gpu per host (2 per nxg node)
#$ -l h_rt=10:0:0   # 10 hour runtime

export CUDA_VISIBLE_DEVICES=${SGE_HGR_gpu// /,}
./run_code.sh