To run long jobs on the central linux system, just log into ts-access
and run them. When you use ts-access
you’re logged into one of our faster machines – our Linux servers – which have more memory and 8x more CPU power than the DPO/EIETL machines, with the same user files as on the usual DPO/EIETL linux machines. Fair use of these machines depends on community spirit and peer pressure. Mail helpdesk if you think someone’s hogging resources.
Some familiarity with running things from the Unix/MacOS command line is essential. Familiarity with shell scripts is an advantage. You can run the programs as you normally would from the Linux/MacOS command line, but when running long programs note that
- You probably won’t be able to give your program input from the command line, especially if it’s asked for long after the program’s started.
- If you put & at the end of your command, you can run your program “in the background”, freeing up the command line to do other things.
- If you put
nohup
at the start of your command, you can log out without the program being killed when you log out.
so once you’re logged into a linux/Unix/MacOS machine at CUED, typing
ssh ts-access ... nohup my/program &
will start running my/program and continue running it after you log out of ts-access
. Output that would normally appear onscreen goes into a file called nohup.out
– though note that the output file won’t immediately be updated.
Before you run your program in the background first check that it starts ok when run normally.
Matlab
If you run a matlab job remember to exit from matlab at the end of the script or function, because Matlab won’t automatically exit. If, for example, you have a file in your home directory called testme.m
containing
disp("hello world!") exit
you could log into ts-access
, type
nohup matlab -r testme &
then log out of ts-access. Soon in your home directory on CUED’s central system you’ll have a file called nohup.out
containing the output of your program. Matlab will no longer be running on the ts-access machine.
If your program doesn’t use the parts of Matlab written in Java (which for number-crunching programs than don’t use the Parallel Computing Toolbox is going to be the case) then you can speed things up by using matlab -nodesktop -nojvm
instead of matlab.
Preparing your code
Try to write your code so that it saves results periodically, and the program can re-start by loading in those results, carrying on from that stage. In this way you can still make progress even if your programs are interrupted by power-cuts, reboots (which happen some Tuesday evenings at CUED, so that we can do updates) etc.
Many programs will run much faster if a little thought is given to optimising the code. Once programs run for days, even an improvement of a few percent becomes significant. See
for ways of speeding your programs up.
If your program requires interaction you’ll need to rewrite it so that interaction isn’t required.
Diagnostic and monitoring commands
You may find these commands useful when monitoring the progress of your program or when reporting problems.
more /etc/centos-release
– tells you the version of the CentOS operating system installed (if any)hostname
– displays the name of the machine you’re ontop
– textually displays load average, size and cpu-load of processes, etc. It updates every few seconds. Type ‘q’ to quit.uname -a
– the name of the machine you’re on, etcgnome-system-monitor
– graphical output showing how busy each CPU core is, etcnproc
– shows how many cores are availablegetconf -a
– show how much memory, cache, etc the machine has
Troubleshooting
Your program may fail for several reasons
- Using too much CPU – the system should be set up so that there’s no limit to your CPU usage. Confirm that by typing “
ulimit
“. You should get the reply “unlimited”. - More limits – if you type “
ulimit -a
” you’ll get a list of some other limits. Some of these are rather esoteric, but the “open files” limit (the number of simultaneously open files a process can have) sometimes comes into play. Try to close files when you’ve finished using them. - Using too much memory – maybe you have a “memory leak”. Each time your program goes round a loop it may ask for more memory until finally there’s no more memory left. Try to free memory that you no longer need. You can use the “top” program to monitor memory usage.
- The machine was rebooted – For details about when the machine was last rebooted, type “
uptime
“ - There’s a bug in your code that’s only triggered after a certain number of iterations or when arrays reach a certain size (because of an unexpected divide-by-zero, or a variable value that becomes bigger than can fit in a variable of that type, etc). Signals are messages that are sent to processes. Typing “
man 7 signal
” will show you a list of them. If your process receives a “SIGSEGV” signal for example, then that generally means a pointer has gone wrong (it’s tried to access a piece of memory it’s not allowed to) and typically indicates a code bug (most frequently trying to dereference a null pointer). Some signals (e.g. “SIGINT”) can be ignored if you choose to do so but “SIGKILL” and “SIGSTOP” can’t and will always stop your program. It’s possible to add a signal handler to your code to deal with signals. Even if you can’t protect your program from being stopped, you might be able to record why it stopped.