IRCF BioCluster

Before using the BioCluster

Summary

The IRCF BioCluster is a beowulf-style cluster designed to support computational biology. It is designed to facilitate large-scale genome and transcriptome assemblies and other “next-generation” and high-throughput analyses.

A general overview presentation: genomescalecomputing-ircf-biocluster.pdf.

How to Get an Account

To get an account on the IRCF BioCluster, contact Scott Givan.

How to Connect

Currently, the preferred way to connect to the BioCluster is using a Secure Shell (SSH) client. On a Windows machine, we recommend MobaXterm or Putty for Windows. Note that if you are associated with the University of Missouri you can download the Educational version of MobaXterm preconfigured to connect to the research clusters on campus (to be used only for research, teaching or student purposes by MU students or staff; thanks to Research Computing Support Services). If you are on a Mac OSX machine, we recommend either using the Terminal window or iTerm2.

For all publicly-accessible machines, establish an SSH connection to the specified IP address on port 734. For example, from the command line a connection to stahl would look like:

ssh -p 734 username@stahl.ircf.missouri.edu

Note: Although several machines listed below are publicly accessible, the main conduit to the infrastructure is stahl. Connections to other machines are only intended for special purposes; ie, transferring large datasets to franklin.

Technical Details

A current list of hardware includes:

NameIPPublicDescription1Scratch DiskMain Activities
stahlstahl.ircf.missouri.eduY2P/16C/32GBn/aprimary logins
lurialuria.ircf.missouri.eduY2P/16C/128GBn/aweb and DB services
franklinfranklin.ircf.missouri.eduY2P/8C/32GB/~100TBn/aNFS
compute-0-1privateN4P/64C/512GB2TBcompute node
compute-0-2privateN4P/24C/1TB1TBcompute node
compute-0-7privateN4P/64C/512GB2TBcompute node
compute-0-11privateN4P/64C/512GB2TBcompute node
compute-0-16privateN4P/64C/512GB2TBcompute node

1 format: number of processors/cores/amount of RAM/shared disk space (if applicable)

Transferring Files to the BioCluster

Lots of Small Files

In this case a “small file” is a file that consumes less than 100MB of disk space.

SFTP

To transfer files onto the BioCluster, use an SFTP client. Most SSH clients also include an SFTP mode. Or, Firefox plugins like FireFTP can be used. When establishing an FTP connection, use your normal authentication credentials. Once your connection is established, you will be in your home directory.

rsync

If you have many files to transfer; for example, if you have been working for some time on another infrastructure and you'd like to transfer your files to the BioCluster, you can use rsync. rsync is typically a command line tool installed on linux machines. After SSH'ing to stahl, a simple way to transfer all your files from Lewis onto the BioCluster is:

rsync -avh -e ssh username@lewis.rnet.missouri.edu:/home/username ./

You can also transfer only specific directories:

rsync -avh -e ssh username@lewis.rnet.missouri.edu:/home/sgivan/data ./data

Given the above commands, rsync will travel recursively through the file structure and only transfer files that are different between the source (lewis) and the target (stahl). So, you can run rsync at different times and only the subset of files that have been changed on the source will be transferred to the target.

Large Files

In this case, a “large file” is a file that consumes more than 100MB of disk space.

Note that large data files should be put in the “data” directory, which is described below.

To transfer large files, establish an SFTP connection directly to the file server, franklin.ircf.missouri.edu, using the same method as described in the How to Connect section, above; but, substitute franklin.ircf.missouri.edu for stahl.ircf.missouri.edu. After you connect, you should see the same file structure you'd normally see after connecting to Stahl. Franklin is a machine specialized for maintaining and managing disk arrays – you can not submit jobs to the cluster or run jobs directly on Franklin.

General Overview of the File System

Home and Data Directories

IRCF Disk Space

Home (ie, /home/sgivan) and data (ie, /home/sgivan/data) directories are available on all machines in the BioCluster. These directories are maintained via NFS connections to franklin. Currently, there is 3.6TB of RAID10 disk space available for home directories and 30TB of RAID6 disk space available for data directories. Every user has a data directory in their home directory. All large datasets should be put in the data directory. There are currently no quota systems in place, but disk usage will be monitored and users who use unusually large amounts of space will be contacted. Legitimate use of the resources will be encouraged and enforced, if necessary.

UPDATE: please see new IRCF BioCluster Disk Usage Policies and Guidelines.

Files in home directories are backup up nightly, while files in the data directory are not.

User Contributed Disk Space

Some research groups may elect to purchse their own disk space to use within the BioCluster. In these situations, the data directory in the home directories of users within that research group will point to the purchased disk space. The IRCF does not place quotas on this disk space, nor do we actively monitor the space. The members of the research group should collectively manage the disk space to best serve their research needs.

For users who have purchased disk space, there will be an additional directory within their home directories called data_ircf. This is a symbolic link to the IRCF disk space, explained above. The IRCF space is still available to these users, but should only be used temporarily. IRCF projects will have priority for the space.

Scratch Space

Each compute node has at least 1TB of local scratch space available in the path /mnt/scratch. Users who are working with large data sets are encouraged to use this space since it will likely lead to higher I/O potential and mitigate network bottlenecks to the NFS storage, which is accessed via a network connection. After a dat aset is analyzed, it should be moved back to the user's *data* directory.

Monitoring Computational Activity on the BioCluster

Ganglia

Many parameters of the activity on the cluster can be viewed via a web interface called Ganglia. This can be accessed via the link https://stahl.ircf.missouri.edu/ganglia/.

Command line

From the command line, you can run top on a specific node:

/ircf/ircfapps/bin/lstop compute-0-1

The above command will run top on compute node compute-0-1. You can see which node your job is running on by running bjobs -w.

Announcements

Announcements relevant to the BioCluster will be posted on the BioCluster website.

Processor Intensive Jobs

Run Processor Intensive Jobs on a Compute Node

Processor-intensive jobs should not be run on stahl, they should be submitted to a compute node, as described in the next section. The login node, stahl, is to be used for staging jobs, editing files and general administrative purposes. Some limited job testing will be allowed on stahl, but prolonged processor-intensive jobs will be killed.

Submitting Jobs to the BioCluster: Openlava

The BioCluster uses the openlava software for Distributed Resource Management (DRM).

If you aren't familiar with the openlava commands, please refer to the Openlava Documentation. You can also refer to the man pages for each command; for example, to learn more about bsub, type man bsub.

In the BioCluster, one of the most important parameters to consider is the memory requirements of the particular job. It is not unusual for a genome or transcriptome assembly to require hundreds of GB of RAM. So, it is important to do some preliminary testing to estimate the RAM required for a job and specify the requirements to openlava so the software can send the job to a machine that will have enough memory to support it. The training presentation, linked above, explains several ways to include job “Resource Requirements”. Typically, the resource requirements are specified in one of two ways:

Resource Requirements

A Non-parallel Job in a bsub Script

This is a typical bsub script that submits a job that requests a single processor core (feel free to copy the text, below, or download the file):

non-parallel-bsub.txt
#BSUB -J TEST
#BSUB -o TEST.o%J
#BSUB -e TEST.e%J
#BSUB -q normal
#BSUB -n 1
#BSUB -R "rusage[mem=940]"
echo 'Hello world' > test.txt

In this script, the #BSUB lines are passing parameters to openlava. Their meanings are:

  • #BSUB -J TEST - the name of this job will be 'TEST'
  • #BSUB -o TEST.o%J - send STDOUT to a file named 'TEST.o' followed by an integer representing the job ID
  • #BSUB -e TEST.e%J - send STDERR to a file named 'TEST.e' followed by an integer representing the job ID
  • #BSUB -q normal - submit job to the 'normal' queue
  • #BSUB -n 1 - this job requests a single processor cores
  • #BSUB -R "rusage[mem=940]" - this line uses the -R flag to specify that this job requests 940 megabytes of RAM per processor core

So, one of the most important lines contains rusage[mem=940], which tells openlava that the job should only be sent to machines that can provide up to 940MB of RAM. Currently, four compute nodes in the BioCluster have 512GB and one has 1TB of RAM. So, the job may be submitted to any node.

If you want to use more than one processor core, please read the next section carefully so that your job doesn't incorrectly split itself across nodes and doesn't accidentally request too much memory.

A Parallel Job in a bsub Script

This is a typical bsub script that submits a job that requests multiple processor cores (feel free to copy and use it):

parallel-bsub.txt
#BSUB -J TEST
#BSUB -o TEST.o%J
#BSUB -e TEST.e%J
#BSUB -q normal
#BSUB -n 4
#BSUB -R "rusage[mem=940] span[hosts=1]"
echo 'Hello world' > test.txt

In this script, the #BSUB lines are passing parameters to openlava. Their meanings are:

  • #BSUB -J TEST - the name of this job will be 'TEST'
  • #BSUB -o TEST.o%J - send STDOUT to a file named 'TEST.o' followed by an integer representing the job ID
  • #BSUB -e TEST.e%J - send STDERR to a file named 'TEST.e' followed by an integer representing the job ID
  • #BSUB -q normal - submit job to the 'normal' queue
  • #BSUB -n 4 - this job requests four processor cores
  • #BSUB -R "rusage[mem=940] span[hosts=1]" - this line uses the -R flag to specify that this job requests 940 megabytes of RAM (rusage[mem=940]) and should run on a single physical host (span[hosts=1]) as opposed to spreading the 4 job slots across multiple physical hosts

So, one of the most important lines contains rusage[mem=940], which tells openlava that the job should only be sent to machines that can provide up to 940MB of RAM per slot. So, the total amount of RAM requested for the entire job is 940MB X 4 slots = 3.76GB RAM. Currently, all of the compute nodes in the BioCluster have at least 512GB of RAM; so any could accommodate this request.

On the Command Line

Almost anything that can be included in the bsub script, above, can also be included on the command line as flags to bsub. For example, an equivalent command to the script would be:

bsub -R "rusage[mem=940000] span[hosts=1]" -J TEST -o TEST.o%J -e TEST.e%J -q normal -n 4 "echo 'Hello world' > test.txt"

RAM Conversion

The following table illustrates how to specify different amounts of RAM for Non-parallel (n=1) and Parallel (n > 1) jobs:

n=* mem= Megabyte (MB) Gigabyte (GB) Terabyte (TB) Notes
1 1 1 0.001 0.000001
1 100 100 0.1 0.0001
1 1000 1000 1 0.001
1 5000 5000 5 0.005
1 1000000 1000000 1000 1
2 100000 200000 200 0.2
4 100000 400000 400 0.4
8 100000 800000 800 0.8 can only run on the highest memory node
16 100000 1600000 1600 1.6 too large to run
* If not specified, n defaults to 1

The mem= column contains the value to use in a bsub submission. For example, to request 5GB of RAM in a non-parallel bsub submission, you should include -R "rusage[mem=5000]" either in a bsub script or as a command line flag to bsub. Remember: for a parallel bsub submision, the RAM is specified per slot. So, consider the following lines in a bsub script:

#BSUB -n 4
#BSUB -R "rusage[mem=2000] span[hosts=1]"

This is requesting 2GB (2000MB) of RAM and 4 slots on a single physical compute node. The total amount of memory requested is 2GB X 4 slots = 8GB of RAM. The job will only be submitted to a compute node that can accommodate the request.

Also, even though several nodes in the BioCluster have 512GB RAM, try to not request 512GB RAM (or 1TB RAM), since those nodes will be ineligible because they will reserve a certain amount of RAM for the operating system.

Removing Jobs from the BioCluster

If you've submitted a job to the queue and you want to delete it, use the openalava bkill command. An easy way is to use the job ID, which you can get using the bjobs command. In the example, below, I use bjobs to get the job ID and use bkill to kill it.

stahl ~/$ bjobs
JOBID   USER    STAT  QUEUE      FROM_HOST   EXEC_HOST   JOB_NAME   SUBMIT_TIME
485219  sgivan  RUN   normal     stahl       compute-0-2 my_job     Apr 14 09:46
                                             compute-0-2
stahl ~/$ bkill 485219
Job <485219> is being terminated
stahl ~/$

Software

Next-Gen Sequence Analysis

Most of the work the IRCF staff does for clients deals with Next-Generation sequence data, especially as generated by the Illumina DNA sequencers. As such, there is a wide variety of Next-Gen sequence analysis software available on the BioCluster. If you can't find what you need in /usr/bin/, /usr/local/bin/ or /ircf/ircfapps/bin/, please contact Scott Givan.

* /tech/short_read_mappers

NCBI Toolkit

The binaries of the NCBI Toolkit are available throughout the cluster. The older C binaries (BLAST), like blastall, formatdb and fastacmd, are available in the /opt/bio/ncbi/bin/ path, which should be in your default path. Whereas the newer C++ binaries (BLAST+), like blastn, blastp, makeblastdb, and blastdbcmd are available in the /ircf/ircfapps/bin/ path, which should also be in your default path. If you are just getting started learning about these programs, you should probably use BLAST+.

BLAST+ documentation

The Module System

There are many custom software packages and alternative versions of default software available on the BioCluster. A relatively easy way to take advantage of these resources is through the linux module command. To see all the software available via the module command, run module avail. As of early 2016, this list of software is available, but new software is continuously added and updated:

[02/22/16 14:17:00] stahl ~/$ module avail

------------- /usr/share/Modules/modulefiles----------------------
Augustus-3.2.1               SPAdes                       ircf-apps
EMAN2-daily-2015-12-15       ShortStack                   java_1.6
EMAN2.1                      Tassel5/tassel-5.2.20        java_1.7
EMAN2.1-daily                TopHat-2.0.13                java_1.8
EMAN2.12                     TopHat-2.1.0                 jellyfish
EMIRGE                       Trinity                      masai
FRC                          VelvetOptimiser              miRanda
FastQC                       ViennaRNA                    module-cvs
GBS/tassel-5.2.15-standalone Yara                         module-info
GMAP                         abyss                        modules
IMAGE                        bcftools                     mummer
KmerGenie                    bcftools-1.2                 novoalign
LASER                        bcftools-1.3                 null
LEfSe                        bedtools                     octave-3.2.4
MACS                         bga                          octave-3.8.1
MEME                         bioperl                      pcap
MaSuRCA                      clusterblast                 quake
MakeSpace                    cogtools                     quantdisplay
NCBI-blast                   crux                         quast
PICRUSt                      ctffind                      quast-3.2
Perl5-vcftools               cufflinks-2.2.1              razers3
PhyloSeqs                    cutadapt                     relion
Python-2.7                   dot                          relion-1.3
Python-opt                   eigensoft                    relion-1.4
Python-shared                etomo                        rocks-openmpi
Qiime-1.8                    eval                         rocks-openmpi_ib
Qiime-1.9                    fasta                        rst
Qiime-1.9.1                  frealign                     samtools-1.2
Qiime-1.9.1-working          gem                          samtools-1.3
R-3.1.0                      gnu                          simple
R-3.2.2-sharedlib            gnu-compile                  ssu-align
R-sharedlib                  gnu-runtime                  swig
RACER                        hapFLK                       texinfo
Ray                          hmmer                        use.own
RepeatMasker                 htslib-1.2.1                 wgs
SHRiMP                       htslib-1.3

--------------- /etc/modulefiles--------------------------------------

compat-openmpi-x86_64 mpich-x86_64          openmpi-x86_64

There are a variety of ways to filter the list of software reported by the module command. For example, if you want to list versions of bowtie available, run module avail bowtie. To load a specific module, use the module load command. For example, to use PICRUSt, run module load PICRUSt. To remove PICRUSt from your path, run: module unload PICRUSt.