Here’s what I’ve learnt so far trying to get my code running on ANU’s National Computational Infrastructure. The NCI website provides a lot information on all of this (see the raijin user guide), but I’ve found it can be a bit opaque. Luckily the help desk staff are very responsive if you ever get stuck (email@example.com).
What is the NCI?
The NCI houses raijin a 60,000 CPU core supercomputer - the largest in the Southern Hemisphere - and its located at the ANU next to South oval.
Applying for time on raijin
For ANU Phd students the first step is to apply for a startup grant. This is a small allocation of time (1000 core hours) to do initial testing: learn how to use the system and give you an idea of how many core hours you will need for your main application.
To apply go here, fill in the form and get your supervisor to sign (technically the application comes from them not you) then email it to the help desk (or just walk it over its not that far!). The application process was a bit confusing at times, but a few emails to the help desk got it done.
Full applications open in November for computing time the next year. The minimum request on raijin is 20,000 core hours, which is fairly large: equivalent to about 4000 hours or 20 weeks time on an i7 desktop or around $5000 worth on a commercial service like multyvac.
Once you’re approved you’ll receive a username and project code via email and a temporary password via sms.
You login online via ssh (if your on windows or mac you will need an ssh client, see the user guide), on linux just open the terminal and type
You’ll get a warning about connecting for the first time, just type yes, then enter your password and your in.
So now you are in a command line linux environment on a single raijin node (with 16 cores). You should have a prompt that looks like this
First lets reset our password
We begin in our home folder, which is on the filesystem at /home/unigrp/username (where unigrp is the last three digits of your username). Here you can install all your code and any dependencies. Any changes you make will be saved when you log out. Your home folder has a capacity of 2 Gb. You can check how much you have used with lquota.
Next we can check our account status
You should see your 1000 hour allocation. Note that these interactive sessions don’t count towards your quota.
raijin has just about all the software you might need already installed (see the list). You just need to make it available to your profile with the module command. To view the list of all software type
To load Python (with numpy, scipy and matplotlib) you type
To use Cython I also needed to replace the default intel C compiler with gcc
You can then add these commands to your .profile file, to make sure they are executed on login. I used vim to edit these text files
Installing your code and dependencies
An easy way to load your code is via github (which is like dropbox for code). Once you’ve learnt github, and have your code in a github repository like this, you can clone it directly onto your profile.
Too easy. I also needed to compile my code, which once I loaded gcc worked just like it does on my local machine.
Next I needed to install a number of other Python packages not included on raijin by default (cython, pandas, scikit-learn). First create a folder to hold them
To install python packages I can use easy_install. You just need to tell easy_install to install locally, for example
Next you need to add ~/packages to your PYTHONPATH environment variable so Python can find it
Its best to make this change permanent by adding it to your .bashrc file.
The first place to store large data files is in your ‘short’ folder, located at /short/projectcode/username. As we saw from the nci_account output, you get 72 Gb on short and 20 on massdata.
To transfer data files between raijin and your local computer you can use rsync. For example, to transfer a file from your short folder to you local machine, navigate to the local folder you want to hold the file then type
For longer term storage, the user guide recommends massdata because it’s backed up.
Just to check that it works I can run my code interactively
Horah it works! To run larger jobs across multiple nodes we need to use the PBS job scheduling system. To run the above job using PBS we type
where jobscript is a text file that contains the following
All of the lines beginning with #PBS are job scheduling options. -P projectcode just tells the system which project to charge the time against. -q express sets the type of ‘queue’ to join, either express (for testing), normal, or copyq (for data intensive jobs) - see the user guide. -l wd just tells the system to start the job from the current directory.
-l walltime=60 is the expected running time of the job in seconds (best to allow slightly longer than you expect). For large jobs you can specify the time in hours:minutes:seconds format. -l ncpus=8 sets the number of cpus the job requires, if greater than 16 (one node) this needs to be in multiples of 16. -l mem=500Mb is the amount of memory required for the job - your job won’t run if you don’t allow enough memory.
Next we need to repeat our module load statements. While all the changes we’ve made to our home folder will be available to the job, anything we’ve added to .profile needs to be repeated. Finally, we add our job command.
So after submitting we get the uninspiring response
where jobid is some number. We can query the status of the job with
Now if the job worked as planned we should find a jobscript.ojobid text file has been created with all of the output (if not you might find a jobscript.ejobid file with an error message). To view any of these files use the cat command
Great so it worked, and we have this resource usage info at the end, which is a good way to work out the CPU / memory requirements of your jobs.