Grid and High-Performance Computing for Applied Bioinformatics. Jorge Andrade - PDF

Please download to get full document.

View again

of 29
All materials on our website are shared by users. If you have any questions about copyright issues, please report us to resolve them. We are always happy to assist you.
Information Report
Category:

Leadership & Management

Published:

Views: 6 | Pages: 29

Extension: PDF | Download: 0

Share
Related documents
Description
Grid and High-Performance Computing for Applied Bioinformatics Jorge Andrade Royal Institute of Technology, School of Biotechnology Stockholm, 2007 Jorge Andrade Jorge Andrade School
Transcript
Grid and High-Performance Computing for Applied Bioinformatics Jorge Andrade Royal Institute of Technology, School of Biotechnology Stockholm, 2007 Jorge Andrade Jorge Andrade School of Biotechnology Royal Institute of Technology AlbaNova University Center SE Stockholm Sweden Printed at Universitetsservice US AB Box Stockholm ISBN TRITA-BIO-Report ISSN Grid and High-Performance Computing for applied Bioinformatics Jorge Andrade (2007). Grid and High-Performance Computing for Applied Bioinformatics. Department of Gene Technology, School of Biotechnology, Royal Institute of Technology (KTH), Stockholm, Sweden ISBN TRITA-BIO-Report ISSN ABSTRACT The beginning of the twenty-first century has been characterized by an explosion of biological information. The avalanche of data grows daily and arises as a consequence of advances in the fields of molecular biology and genomics and proteomics. The challenge for nowadays biologist lies in the de-codification of this huge and complex data, in order to achieve a better understanding of how our genes shape who we are, how our genome evolved, and how we function. Without the annotation and data mining, the information provided by for example high throughput genomic sequencing projects is not very useful. Bioinformatics is the application of computer science and technology to the management and analysis of biological data, in an effort to address biological questions. The work presented in this thesis has focused on the use of Grid and High Performance Computing for solving computationally expensive bioinformatics tasks, where, due to the very large amount of available data and the complexity of the tasks, new solutions are required for efficient data analysis and interpretation. Three major research topics are addressed; First, the use of grids for distributing the execution of sequence based proteomic analysis, its application in optimal epitope selection and in a proteome-wide effort to map the linear epitopes in the human proteome. Second, the application of grid technology in genetic association studies, which enabled the analysis of thousand of simulated genotypes, and finally the development and application of a economic based model for grid-job scheduling and resource administration. The applications of the grid based technology developed in the present investigation, results in successfully tagging and linking chromosomes regions in Alzheimer disease, proteome-wide mapping of the linear epitopes, and the development of a Market-Based Resource Allocation in Grid for Scientific Applications. Keywords: Grid computing, bioinformatics, genomics, proteomics. 3 Jorge Andrade Grid and High-Performance Computing for applied Bioinformatics LIST OF PUBLICATIONS This thesis is based on the papers listed below, which will be referred to by their roman numerals I. Jorge Andrade*, Lisa Berglund*, Mathias Uhlén and Jacob Odeberg. Using Grid Technology for Computationally Intensive Applied Bioinformatics Analyses, In Silico Biology 6 (2006) IOS Press /06 II. Lisa Berglund*, Jorge Andrade*, Jacob Odeberg and Mathias Uhlen. The linear epitope space of the human proteome (2007) Submitted III. Jorge Andrade, Malin Andersen, Sillén Anna, Caroline Graff, Jacob Odeberg. The use of grid computing to drive data-intensive genetic research. Eur. J. Hum. Genet. (2007) 15, IV. Anna Sillén, Jorge Andrade, Lena Lilius, Charlotte Forsell, Karin Axelman, Jacob Odeberg, Bengt Winblad and Caroline Graff. Expanded high-resolution genetic study of 109 Swedish families with Alzheimer's disease Eur. J. Hum. Genet. (2007) V. Thomas Sandholm, Jorge Andrade, Jacob Odeberg, Kevin Lai. Market-Based Resource Allocation using Price Prediction in a High Performance Computing Grid for Scientific Applications. High Performance Distributed Computing, (2006) 15th IEEE International Symposium ISSN: * These authors contributed equally to the work Related publications 1. Mercke Odeberg J, Andrade J, Holmberg K, Hoglund P, Malmqvist U, Odeberg J. UGT1A polymorphisms in a Swedish cohort and a human diversity panel, and the relation to bilirubin plasma levels in males and females. European Journal of Clinical Pharmacology. (2006) 2. Andrade J, Andersen M, Berglund L, Odeberg J. Applications of Grid computing in genetics and proteomics. Proceedings of PARA06 workshop on state-of-the-art in scientific and parallel computing Springer series Lecture Notes in Computer Science (LNCS) Articles printed with permission from the respective publisher. 5 Jorge Andrade Table of contents I INTRODUCTION 9 1. INTRODUCTION An explosion of biological information Computer science in biology - Bioinformatics EXAMPLES OF BIOINFORMATICS APPLICATION RESEARCH IN BIOLOGY Genomic and proteomics databases Analysis of gene expression Analysis of protein levels Prediction of protein structure Protein-protein docking High-throughput image analysis Simulation based linkage and association studies Systems biology COMPUTATIONAL CHALLENGES IN BIOINFORMATICS The problem of growing size The problem of storage and data distribution The problem of data complexity EMERGING DISTRIBUTED COMPUTING TECHNOLOGIES An introduction to grid computing Virtual Organizations Examples of Computational Grids The European DataGrid The Enabling Grids for E-sciencE project (EGEE) Nordugrid / Swedgrid The TeraGrid project The Open Science Grid Software Technologies for the Grid Globus Condor Models for Grid Resource Management and Job Scheduling GRAM (Grid Resource Allocation Manager) Economic-based Grid Resource Management and Scheduling Grid-based initiatives approaching applied bioinformatics 28 II PRESENT INVESTIGATION APPLICATIONS OF GRID TECHNOLOGY IN PROTEOMICS (PAPER I AND II) Grid technology applied to sequence similarity searches (Grid-Blast) 34 Grid and High-Performance Computing for applied Bioinformatics 5.2 Grid based proteomic similarity searches using non-heuristic algorithms APPLICATIONS OF GRID TECHNOLOGY IN GENETICS. (PAPER III AND IV) Grid technology applied to genetic association studies (Grid-Allegro) Genetic study of 109 Swedish families with Alzheimer s disease RESOURCE ALLOCATION IN GRID COMPUTING (PAPER V) Market-Based Resource Allocation in Grid FUTURE PERSPECTIVES 41 ABBREVIATIONS 42 ACKNOWLEDGEMENTS 43 REFERENCES 44 7 Jorge Andrade Grid and High-Performance Computing for applied Bioinformatics I INTRODUCTION 9 Jorge Andrade Grid and High-Performance Computing for applied Bioinformatics Chapter 1 1. Introduction 1.1 An explosion of biological information The beginning of the twenty first century can be characterized by an explosion of biological information. The avalanche of data grows daily and arises as a consequence of advances in the fields of molecular biology and genomics. The genetic information is codified and stored in the nucleus of the cells that are the fundamental working units of every living system. All the instructions needed to direct their activities are contained within the chemical DNA. As stated in the central dogma of molecular biology (Figure 1), genetic information flows from genes, via RNA, to proteins. Figure 1. Diagram of the central dogma, from DNA to RNA to protein, illustrating the genetic code. Proteins perform most of the cellular functions and constitute the majority of the cellular structures. Proteins are often large, complex molecules made up of smaller polymerised subunits called amino acids. Chemical properties that distinguish the twenty different amino acids cause the protein chains to fold up into specific three-dimensional structures that define their particular functions in the cell. Studies to explore protein structure and activities, known as proteomics, will be the focus of much research for decades in order to elucidate and understand the molecular basis of health and disease. 11 Jorge Andrade 1.2 Computer science in biology - Bioinformatics The challenge for nowadays biologist lies in the de-codification of this huge and complex data from the biological language, in order to better understand of how our genes shape who we are, how our genome evolved, and how we function. Without annotation and detailed data mining, the information provided by the high throughput genomic sequencing projects is not very useful. Bioinformatics is an interdisciplinary research area that involves the use of techniques including applied mathematics, informatics, statistics, computer science, artificial intelligence, chemistry, and biochemistry to solve biological problems usually on the molecular level. The ultimate goal of bioinformatics is to uncover and decipher the richness of biological information hidden in the mass of data and to obtain a clearer insight into the fundamental biology of organisms. 2. Examples of bioinformatics application research in biology. 2.1 Genomic and proteomics databases Since the sequencing of the first organism (Phi-X174 phage) by Fred Sanger and his team in 1977 (Sanger, Air et al. 1977), the DNA sequences of hundreds of organisms have been decoded and stored in genomic databases (Galperin 2007; Hutchison 2007). Sequence analysis in molecular biology and bioinformatics is an automated, computer-based examination and characterization of the DNA sequences. For the human genome, the genomic sequence data is easily accessible to non-bioinformaticians through genome browsers like UCSC genome browser (http://genome.ucsc.edu/), where the information from bioinformatics sequence-based analyses together with annotated sequence-based experimental data is mapped to the genomic sequence easily navigated with links to complementary databases and information sources. The protein databases are populated with the results of classical protein research as well as predictions computed from genomic information and there exists a variety of such. UniProt (http://www.ebi.uniprot.org) is a comprehensive catalog containing protein sequence-related information from several sources and databases. Proteomics databases containing data collected in proteomics experiments are for example the PeptideAtlas (http://www.peptideatlas.org), the OPD Open Proteomics Database Grid and High-Performance Computing for applied Bioinformatics (http://bioinformatics.icmb.utexas.edu/opd), the Global Proteome Machine GPM (http://www.thegpm.org), the Human Proteome Atlas (http://www.proteinatlas.org), the World-2DPAGE (http://www.expasy.ch/world-2dpage/) containing experimental 2D gels, the Protein Data Bank PDB containing three-dimensional structures of proteins (http://www.rcsb.org/pdb/home/home.do), and databases like BioGRID: Biological General Repository for Interaction Datasets (http://www.thebiogrid.org), containing information of protein-protein interaction. Furthermore, the gene ontology is a database of terms that classify protein functions, processes and sub-cellular locations accessible through sites such as Online mendelian OMIM (www.ncbi.nlm.nih.gov/omim/) relates proteins and genes to established roles in different diseases, and finally pubmed (http://www.ncbi.nlm.nih.gov/) contains all published research articles in biology and biomedical research. These are examples of the data sources and tools publically available to genomics and proteomics research, and the heterogeneous information content, storage structures and search functions make integrative bioinformatic analysis and data-mining difficult at present and a major challenge for developers of tools for applied bioinformatics 2.3 Analysis of gene expression The expression of many genes can be determined by measuring mrna levels with multiple techniques including microarrays, expressed cdna sequence tag (EST) sequencing, serial analysis of gene expression (SAGE) tag sequencing, massively parallel signature sequencing (MPSS), or various applications of multiplexed in-situ hybridizations. Most of these techniques generate data that has a high noise-level component, and these techniques may also be biased in some way in the biological measurements, and a major research area of bioinformatics is therefore to develop statistical tools and methods to separate signal from noise in these high-throughput gene expression studies (Pehkonen, Wong et al. 2005; Miller, Ridker et al. 2007; Nie, Wu et al. 2007). Such studies may for example be used to determine genes implicated in a certain medical disorder: e.g. one might compare microarray data from cancerous epithelial cells to data from non-malignant cells to determine the transcripts that are up-regulated and down-regulated in a particular population of cancer cells (Fournier, Martin et al. 2006). 2.4 Analysis of protein levels Protein microarrays and high-throughput mass spectrometry techniques provide a snapshot of the proteins present in a biological sample. Bioinformatic tools are developed and applied to making sense of protein microarray and the mass spectrometry data; this approach faces similar problems as microarrays targeted for measurement of mrna levels. 13 Jorge Andrade One problem is to match large amounts of observed protein mass data against predicted masses from protein sequence databases, and to carry out statistical analysis of samples where multiple, but incomplete, peptides from each protein are detected. 2.5 Prediction of protein structure Protein structure prediction is another important application of bioinformatics (Godzik, Jambon et al. 2007). The amino acid sequence of a protein, also called primary structure, can be easily determined from the sequence of the gene that encodes for it. In the majority of cases, this primary structure uniquely determines a structure in its native environment. Knowledge of this structure is vital in understanding the function of the protein. Structural information is usually classified as one of secondary, tertiary and quaternary structure. A viable general solution to such predictions remains as a challenge for bioinformatics. Most efforts have been focused on developing heuristics methods that work most of the time. Using these methods it is possible to use homology to predict the function of a gene: if the sequence of gene A, whose function is known, is homologous to the sequence of gene B, whose function is unknown, one could infer that B may share A's function. In a similar way homology modeling is used to predict the structure of a protein once the structure of a homologous protein is known. 2.6 Protein-protein docking During the last two decades thousands of three-dimensional structures of proteins have been determined by X-ray crystallography and protein nuclear magnetic resonance spectroscopy (protein NMR) (Carugo 2007). One central challenge for bioinformaticians includes the prediction of protein-protein interactions based on these three-dimensional structures, without carrying out experimental protein-protein interaction experiments. A variety of methods have been developed to address the protein-protein docking problem (Law, Hotchko et al. 2005; Bernauer, Aze et al. 2007), but it seems that there is still much place to work on in this field. 2.7 High-throughput image analysis Another exiting research area involving bioinformatics, is the use of computational technologies to accelerate or fully automate the processing, quantification and analysis of large amounts of high-information-content biomedical images (Chen and Murphy 2006; Zhao, Wu et al. 2006; Pan, Gurcan et al. 2007). Modern image analysis systems greatly facilitate the observer's ability to make measurements from a large or complex set of images, by improving accuracy, objectivity, and speed. A fully developed analysis system Grid and High-Performance Computing for applied Bioinformatics could possibly replace specialized observers completely in the future (Harder, Mora- Bermudez et al. 2006). Although these systems are not unique to biomedical sciences, biomedical imaging is becoming more and more important for both diagnostics and research. 2.8 Simulation based linkage and association studies Linkage and association studies routinely involve analyzing a large number of genetic markers in many individuals to test for co-segregation or associaton of marker loci and disease. The use of simulated genotype datasets for standard case-control or affected non affected analysis allows for considerable flexibility in using/generating different disease models, this potentially involving a large number of interacting loci (typically SNPs or micro satellites). The shift to dense SNP maps poses new problems to pedigree analysis packages like Genehunter (Kruglyak, Daly et al. 1996) or Allegro (Gudbjartsson, Jonasson et al. 2000), which can handle arbitrarily many markers but are limited to 25-bit pedigrees. The bit size of a pedigree is 2n - f - g, where n is the number of non-founders, f is the number of founders and g is the number of un-genotyped founder couples. Association mapping studies using statistical methods requires specialized inferential and computational techniques; when applied to genome-wide studies, the computational cost in terms of memory and CPU time grows exponentially with the sample size (pedigree or number of markers). 2.9 Systems biology Perhaps the biggest challenge in bioinformatics will arise in the integration of the previously described resources and methods, the research field popularly called systems biology that focuses on the systematic study of complex interactions between the components of a biological system, and how these interactions give rise to the function and behavior of that system (Snoep, Bruggeman et al. 2006; Sauer, Heinemann et al. 2007). Systems biology involves constructing mechanistic models based on data obtained through transcriptomics, metabolomics, proteomics and high-throughput techniques using interdisciplinary tools, and the validation of these models. As an example, cellular networks may be mathematically modeled using methods from kinetics and control theory. Because of the large number of variables, parameters, and constraints in such networks, numerical and computational techniques are required. Other aspects of computer science and informatics are also used in systems biology. These include the integration of experimentally derived data with information available in the public domain using information extraction and text mining techniques. 15 Jorge Andrade Taken together, the wide collection of problems where bioinformatics tools are required to solve specific problems clearly highlights the importance of the field in modern science. Chapter 3 3. Computational challenges in bioinformatics 3.1 The problem of growing size The post-genomic era is characterised by an increasing amount of available data that is generated through studies of different organisms. Taking the human body as an example, experimental data is derived from a system consisting of 3 billion nucleotides, which in turn contains approximately 30,000 genes encoding for 100, ,000 transcript variants transcribed into proteins with varying expression patterns in the 100 trillion cells of 300 different cell types, finally resulting in 14,000 distinguishable morphological structures in the human body. It is obvious that comprehensive studies into this complex system will require computational processing power and alternative paradigms for data integration, replication and organization. 3.2 The problem of storage and data distribution Public biological databases are growing exponentially, exemplified by the growth of the Genbank sequence database, release 155, produced in August 2006, which contained over 65 billion nucleotide bases in more than 61 million sequences (Dennis A. Benson and Wheeler 2006). Data formats are heterogeneous, geographically distributed, and stored in different database architectures (Goesmann, Linke et al. 2003). Biological data is very complex and interlinked, and to extract meaningful knowledge from one type of data, it has to be analyzed in context of immediately re
Recommended
View more...
We Need Your Support
Thank you for visiting our website and your interest in our free products and services. We are nonprofit website to share and download documents. To the running of this website, we need your help to support us.

Thanks to everyone for your continued support.

No, Thanks