TeraGrid Logo
Learn About the TeraGridTeraGrid NewsEducation & TrainingScience GatewaysUser Support & Documentation

Section site map: News

News

2005 News Stories

News

GADU/GNARE Uses TeraGrid For Protein Sequence Analysis

Authors: Dr. Natalia Maltsev, Computational Biology Group (UC/ANL), and Rick Stevens (UC/ANL).
Web Site: http://compbio.mcs.anl.gov/

GADU/GNARE, the Genome Analysis and Database Update system developed at the Mathematics and Computer Science division of Argonne National Laboratory, has successfully used TeraGrid resources for performing the periodic high-throughput analysis of all publicly available protein sequences using bioinformatics tools (e.g., Blast and Blocks). For example, the size of the NCBI non-redundant protein database is currently 2.3 Million sequences. Analysis of this data using Blast and Blocks requires on the order of 7 M processes. A typical Blast or Blocks workflow includes several steps: splitting of the input file into smaller files that will be submitted to individual nodes on the TeraGrid, execution of the bioinformatics tool by the node, followed by parsing the results from each node. After all the nodes finish parsing, the results are concatenated and the final output is sent back to the submit host. All of the workflow on the Grid is managed by the GriPhyN Virtual Data System, using Condor-G and Globus. Figure 1 shows a DAG representing the BLAST workflow that was used to execute jobs on the TeraGrid. The results of the analysis are then stored in a relational (Oracle) database. The stored data is used for building different bioinformatics applications. PUMA2 is an example of such an application. It contains the analysis of 1031 genomes pre-computed on TeraGrid and other grid resources. The results are used by the algorithms for automated annotation of the sequence data and displayed to the user for further interactive analysis. The results of these analyses are also used by other resources, including Pathos ? Microbial informatics Core for NIH Great Lakes Center of Excellence in Biodefense and emerging infections, TarGet NIH Midwest Structural Biology Center, MetaGenomes ? DOE Microbial Genomes program, and others.

Table 1 gives the statistics of a Blast run using TeraGrid and Grid3 resources:

Blast Database: 2314886 sequences.
One CPU: (time in walltime). 100 sequences took 66mins on one CPU. 2314886 sequences would take ~1527825mins (i.e, 25463.746 hours or 1061 days :-()
On the Grid: (using TeraGrid, Grid3). 2314886 sequences took 12480 mins (i.e., 208 hours or 8 days 16 hours ) So one Genome (approximately 4000 seqs) would take 22 mins on the Grid.
Number of CPUs used on the Grid at a given time varies based on the availability of CPUs and the max load the submit host can take. Max CPUs used at any given time was 250. And the average number of CPUs used is 200.

Figure 1: DAG showing a workflow for BLAST used on TeraGrid.

Figure 2: GADU is a Genome Analysis and Databases Update Tool for the Mathematics and Computer Science (MCS) department at Argonne National Laboratories (ANL). GADU is an automated tool that searches periodically through different DNA and protein databases for new and newly updated genomes of different organisms.

More News Releases

TeraGrid logo
NSF logo

The TeraGrid project is funded by the National Science Foundation and includes 11 partners:
Indiana, LONI, NCAR, NCSA, NICS, ORNL, PSC, Purdue, SDSC, TACC and UC/ANL.

Please email help@teragrid.org with questions or comments or fill out the online feedback form.