Here’s what I’ve learnt so far trying to get my code running on ANU’s National Computational Infrastructure. The NCI website provides a lot information on all of this (see the raijin user guide), but I’ve found it can be a bit opaque. Luckily the help desk staff are very responsive if you ever get stuck (help@nf.nci.org.au).

What is the NCI?

The NCI houses raijin a 60,000 CPU core supercomputer - the largest in the Southern Hemisphere - and its located at the ANU next to South oval.

Applying for time on raijin

For ANU Phd students the first step is to apply for a startup grant. This is a small allocation of time (1000 core hours) to do initial testing: learn how to use the system and give you an idea of how many core hours you will need for your main application.

To apply go here, fill in the form and get your supervisor to sign (technically the application comes from them not you) then email it to the help desk (or just walk it over its not that far!). The application process was a bit confusing at times, but a few emails to the help desk got it done.

Full applications open in November for computing time the next year. The minimum request on raijin is 20,000 core hours, which is fairly large: equivalent to about 4000 hours or 20 weeks time on an i7 desktop or around $5000 worth on a commercial service like multyvac.

Logging in

Once you’re approved you’ll receive a username and project code via email and a temporary password via sms.

You login online via ssh (if your on windows or mac you will need an ssh client, see the user guide), on linux just open the terminal and type

ssh -l username raijin.nci.org.au

You’ll get a warning about connecting for the first time, just type yes, then enter your password and your in.

###############################################################################
#                            NCI National Facility                            #
#      This service is for authorised clients only. It is a criminal          #
#      offence to:                                                            #
#                - Obtain access to data without permission                   #
#                - Damage, delete, alter or insert data without permission    #
#      Use of this system requires acceptance of the Conditions of Use        #
#      published at http://nci.org.au/conditions/                             #
###############################################################################
|                  Welcome to the NCI National Facility!                      |
|        raijin.nci.org.au - 57472 processor InfiniBand x86_64 cluster        |
|     Assistance: help@nci.org.au     Information: http://nci.org.au          |
===============================================================================

First steps

So now you are in a command line linux environment on a single raijin node (with 16 cores). You should have a prompt that looks like this

[username@raijin4 ~]$

First lets reset our password

[username@raijin4 ~]$ passwd

We begin in our home folder, which is on the filesystem at /home/unigrp/username (where unigrp is the last three digits of your username). Here you can install all your code and any dependencies. Any changes you make will be saved when you log out. Your home folder has a capacity of 2 Gb. You can check how much you have used with lquota.

[username@raijin4 ~]$ lquota
--------------------------------------------------------------
          fs  Usage  Quota  Limit   iUsage    iQuota   iLimit
--------------------------------------------------------------
ndh401  home  383MB 2000MB 2000MB     5449         0    88000 
   fr3 short   28kB 72.0GB  144GB        7    164000   328000 
--------------------------------------------------------------

Next we can check our account status

[username@raijin4 ~]$ nci_account

Usage Report: Project=projectcode Compute Period=2014.q4 (01/10/2014-31/12/2014)
========================================================================

Total Grant: 1000.00 SU
Total Used:   0.00 SU
Total Avail: 1000.00 SU
Bonus Used:   0.00 SU

-------------------------------------------------------------------------------------------------------------
System      Queue  Charge         Usage         Usage       SU used   Reserved SU    Pending SU      Total SU
                   Weight     (CPU Hrs)    (Walltime)                   (Running)      (Queued)     Committed
raijin      copyq     1.0          0.00          0.00          0.00          0.00          0.00          0.00
raijin    express     3.0          0.00          0.00          0.00          0.00          0.00          0.00
raijin    hugemem     1.0          0.00          0.00          0.00          0.00          0.00          0.00
raijin     normal     1.0          0.00          0.00          0.00          0.00          0.00          0.00
-------------------------------------------------------------------------------------------------------------
Overall                            0.00          0.00          0.00          0.00          0.00          0.00


Usage Report: Project=projectcode Storage Period=2014.10 (01/10/2014-31/12/2014)
========================================================================
-------------------------------------------------------------------------------------------------
System    StoragePt             Grant       Usage       Avail      iGrant      iUsage      iAvail
-------------------------------------------------------------------------------------------------
dmf       massdata            20.00GB      0.00GB     20.00GB     100.00K       0.00K     100.00K
raijin    short               72.00GB      0.00GB     72.00GB     164.00K       0.01K     163.99K
-------------------------------------------------------------------------------------------------
Total                         92.00GB      0.00GB     92.00GB     264.00K       0.01K     263.99K

You should see your 1000 hour allocation. Note that these interactive sessions don’t count towards your quota.

Installing software

raijin has just about all the software you might need already installed (see the list). You just need to make it available to your profile with the module command. To view the list of all software type

[username@raijin4 ~] module avail

To load Python (with numpy, scipy and matplotlib) you type

[username@raijin4 ~] module load python/2.7.3
[username@raijin4 ~] module load python/2.7.3-matplotlib

To use Cython I also needed to replace the default intel C compiler with gcc

[username@raijin4 ~] module unload intel-cc
[username@raijin4 ~] module load gcc/4.9.0

You can then add these commands to your .profile file, to make sure they are executed on login. I used vim to edit these text files

[username@raijin4 ~] vim .profile

Installing your code and dependencies

An easy way to load your code is via github (which is like dropbox for code). Once you’ve learnt github, and have your code in a github repository like this, you can clone it directly onto your profile.

[username@raijin4 ~] git clone git://github.com/nealbob/regrivermod.git ~/Model

Too easy. I also needed to compile my code, which once I loaded gcc worked just like it does on my local machine.

Next I needed to install a number of other Python packages not included on raijin by default (cython, pandas, scikit-learn). First create a folder to hold them

[username@raijin4 ~] mkdir packages

To install python packages I can use easy_install. You just need to tell easy_install to install locally, for example

[username@raijin4 ~] easy_install --install-dir=~/packages pandas

Next you need to add ~/packages to your PYTHONPATH environment variable so Python can find it

[username@raijin4 ~] export PYTHONPATH=~/packages:$PYTHONPATH

Its best to make this change permanent by adding it to your .bashrc file.

Data storage

The first place to store large data files is in your ‘short’ folder, located at /short/projectcode/username. As we saw from the nci_account output, you get 72 Gb on short and 20 on massdata.

To transfer data files between raijin and your local computer you can use rsync. For example, to transfer a file from your short folder to you local machine, navigate to the local folder you want to hold the file then type

[username@raijin4 ~] rsync  -e "ssh -c arcfour" username@r-dm.nci.org.au:/short/projectcode/username/filename

For longer term storage, the user guide recommends massdata because it’s backed up.

Running jobs

Just to check that it works I can run my code interactively

[username@raijin4 ~]$ cd Model
[username@raijin4 Model]$ python test.py

 --- Main parameters --- 

Inflow to capacity: 0.708747484611
Coefficient of variation: 0.7
Proportion of high demand: 0.234622818122
Target water price: 10.0
Transaction cost: 55.0
High user inflow share: 0.469245636245
Land: 4833.896933
High Land: 0.0655588606907

Decentralised storage model with 100 users. 

Solving the planner's problem...
PI Iteration: 1, Error: 100.0, PE Iterations: 68
PI Iteration: 2, Error: 0.0102, PE Iterations: 11
PI Iteration: 3, Error: 0.0014, PE Iterations: 2
PI Iteration: 4, Error: 0.001, PE Iterations: 1
Solve time: 11.7644062042
Running simulation for 500000 periods...
Simulation time: 5.42
Summary stats time 1.61270594597
Data stacking time: 1.63947796822
Storage mean: 698008.184127
Inflow mean: 694593.226731
Withdrawal mean: 520950.115529
Welfare mean: 186576969.954

Horah it works! To run larger jobs across multiple nodes we need to use the PBS job scheduling system. To run the above job using PBS we type

[username@raijin1 Model]$ qsub jobscript

where jobscript is a text file that contains the following

#!/bin/bash
#PBS -P projectcode 
#PBS -q express
#PBS -l walltime=60
#PBS -l mem=500MB
#PBS -l ncpus=8
#PBS -l wd
module load python/2.7.3
module load python/2.7.3-matplotlib
python test.py

All of the lines beginning with #PBS are job scheduling options. -P projectcode just tells the system which project to charge the time against. -q express sets the type of ‘queue’ to join, either express (for testing), normal, or copyq (for data intensive jobs) - see the user guide. -l wd just tells the system to start the job from the current directory.

-l walltime=60 is the expected running time of the job in seconds (best to allow slightly longer than you expect). For large jobs you can specify the time in hours:minutes:seconds format. -l ncpus=8 sets the number of cpus the job requires, if greater than 16 (one node) this needs to be in multiples of 16. -l mem=500Mb is the amount of memory required for the job - your job won’t run if you don’t allow enough memory.

Next we need to repeat our module load statements. While all the changes we’ve made to our home folder will be available to the job, anything we’ve added to .profile needs to be repeated. Finally, we add our job command.

So after submitting we get the uninspiring response

jobid.r-man2

where jobid is some number. We can query the status of the job with

[username@raijin1 Model]$ qstat -s jobid
qstat: 7516166.r-man2 Job has finished, use -x or -H to obtain historical job information
[username@raijin1 Model]$ qstat -s jobid -x

r-man2: 
                                                            Req'd  Req'd   Elap
Job ID          Username Queue    Jobname    SessID NDS TSK Memory Time  S Time
--------------- -------- -------- ---------- ------ --- --- ------ ----- - -----
jobid.r-man2   username express- jobscript   27544  --   8  500mb 00:01 F 00:00
   Job run at Tue Oct 28 at 08:58 on (r102:jobfs_local=102400kb:mem=512000...

Now if the job worked as planned we should find a jobscript.ojobid text file has been created with all of the output (if not you might find a jobscript.ejobid file with an error message). To view any of these files use the cat command

[username@raijin1 Model]$ cat jobscript.ojobid

 --- Main parameters --- 

Inflow to capacity: 0.708747484611
Coefficient of variation: 0.7
Proportion of high demand: 0.234622818122
Target water price: 10.0
Transaction cost: 55.0
High user inflow share: 0.469245636245
Land: 4833.896933
High Land: 0.0655588606907

Decentralised storage model with 100 users. 

Solving the planner's problem...
PI Iteration: 1, Error: 100.0, PE Iterations: 68
PI Iteration: 2, Error: 0.0102, PE Iterations: 11
PI Iteration: 3, Error: 0.0014, PE Iterations: 2
PI Iteration: 4, Error: 0.001, PE Iterations: 1
Solve time: 3.24676513672
Running simulation for 500000 periods...
Simulation time: 0.89
Summary stats time 1.6745159626
Data stacking time: 1.70569396019
Storage mean: 697642.617942
Inflow mean: 693752.110128
Withdrawal mean: 520867.804524
Welfare mean: 186632658.338
======================================================================================
			Resource Usage on 2014-10-28 08:59:26.639311:
	JobId:  jobid.r-man2  
	Project: projectcode 
	Exit Status: 0 (Linux Signal 0)
	Service Units: 0.04
	NCPUs Requested: 8				NCPUs Used: 8
							CPU Time Used: 00:00:36
	Memory Requested: 500mb 			Memory Used: 56mb
							Vmem Used: 531mb
	Walltime requested: 00:01:00 			Walltime Used: 00:00:20
	jobfs request: 100mb				jobfs used: 1mb
======================================================================================

Great so it worked, and we have this resource usage info at the end, which is a good way to work out the CPU / memory requirements of your jobs.