Uppsala universitet

Overview in bioinformatics
UPDOK
To get started
Schedule
Teachers
Contact
Molecular evolution



Department of Molecular Evolution

Overview in bioinformatics 2003




Bioinformatics - A New Multidisciplinary Tool

Author: Siv G. E. Andersson
Sources: David Benton: Bioinformatics - principles and potential of a new multidisciplinary tool. TIBS 14:261-272.


Table of contents

  • Aim of the course
  • What is bioinformatics?
  • Automated DNA sequencing
  • Data management
  • Making sense of the data
  • Predicting functions
  • Predicting structures
  • Molecular evolution and phylogenetic trees
  • Challenges in bioinformatics



Aim of the course

The aim of this course is to provide an overview of a new research field called bioinformatics. We are going to discuss general questions concerning the role of bioinformatics for the pharmaceutical industry, for academy and for society. The first three chapters will give an overview of the technological developments that have triggered the development in bioinformatics that we witness today and the political discussions and decisions underlying these developments. The following three chapters will show how biological information is stored in huge, international databases and how this information can be accessed. Next, we will discuss the current status of bioinformatics in the Swedish pharmaceutical and biotechnological industries in comparison to the role of bioinformatics in academia. Finally, we will describe some major challenges that biologists will be facing during the next decade and areas in which bioinformatics is likely to play a critical role.


1. 1 What is bioinformatics?

Let us first discuss the question: What is bioinformatics? The simple answer is that it is a new research field at the border of biology, mathematics and computer science. The materials of bioinformatics are biological data, and its methods are derived from a wide variety of computational and mathematical methods. These methods are necessary for research in areas as different as molecular evolution, genomics, structural biology and structure-based drug design. For molecular biologists, training in bioinformatics is likely to become as essential as training in how to handle a Gilson pipette. For mathematicians and computer scientists, training in bioinformatics will be considered a great asset to their other skills!

Let us now phrase this question in a slightly different way: What are the prime accomplishments of bioinformatics during the last decade? Richard Durbin, former Head of the Informatics Division at the Sanger Centre in Cambridge, UK has answered this question by emhasizing three main accomplishments:

  1. The most important task in bioinformatics is the creation of nucleotide and protein databases, and making them accessible to the bioinformatics community. Especially good software to search the databases and to obtain relevant data is very important. In my opinion we have done a good job at that.

  2. The second accomplishment is the provision of computer support for the genome projects. Large-scale genome projects were only possible because of advances in computer technology the last 10 years. We are currently keeping in step with the advances in computer technology; it might even be a limiting factor for the advancement of bioinformatics.

  3. Thirdly, advances in protein structure and prediction, which is in my opinion a gradual process -- there are few dramatic leaps forward but this is still an important accomplishment closely linked with bioinformatics.


1. 2 The genetic material of biological creatures

The genetic material of every living organism consists of four nucleotides, also called letters (A, T, C, G). The complete set of nucleotides in a cell is called a genome and each cell in a multi-cellular organism contains one or more complete copies of this genetic material. Because there is a myriad of alternative ways in which four letters can be sorted in a string of millions of letters, the result may either be a microbe, an elephant or a human being! A microbe has a genome of about 1 to 10 million nucleotides, whereas a higher organism, such as a humans, have a genome of as much as 3000 million nucleotides.

A genome project is a project with the aim of determining each and every nucleotide in the genetic material of the organism selected for analysis. This is like opening the door to an office filled with papers describing all the different components required for making life! However, merely listing the letters (ATCGGCC...etc) is not enough for understanding what they are doing or how this organism can be built. Only someone who is able to read the language of the genetic code is able to transform the genetic information into suggestions about what this information means in the biological world. Since we are talking about maybe as much as 3000 million letters, eyeballing is not enough - we need computers to help us read and interpret the information. This is why the bioinformatician is so important! He or she is the person with the computational expertise required to handle the data using computers and with the biological expertise required for understanding what the information that is constantly being spitted out from the computers really means. A bioinformatician may also assist in developing algorithms or building systems for storing and querying the information rather than being involved in the analysis of the information.

Genome projects and other large-scale biological research are now producing enormous quantities of biological data that will never be published in the traditional litterature. Nucleotide sequences are being added to the databases at a rate of more than 210 million base pairs (bp) per year, and the database content is doubling in size approximately every 14 months. There are now complete genome sequence data available for all major kingdoms of life, bacteria archaea and eukaryotes. Comparative analyses of the genomes of many different organisms have revolutionized our concepts of biological diversity.

The sequence of the first complete genome (Haemophilus influenzae) was published in 1995 and complete genome sequence information is now available for more than 75 organisms (TIGR Microbial Database). The exponential increase in the amount of sequence data stored in the public databases and the continuous development of novel methods and tools for the analysis of DNA sequences represent new challenges for modern molecular biologists. An understanding of bioinformatic methods is required in order to be able to handle, analyze and interpret the large volumes of sequence data that will be generated in the near future. Bioinformatics is thus emerging as a new field of research of relevance for basic sciences as well as for the many projects initiated on the design of new antibiotics and vaccine strategies by the pharmaceutical industries. Since much of the progress in bioinformatics is due to the accelerated rate at which sequence data is being produced, we will start by giving a brief historical overview of the efficiency of DNA sequencing.


1.3 Improvements in the efficiency of automated DNA sequencing

For the last decade, wet molecular biology laboratory work has primarily been associated with DNA sequencing (i.e determining the identity and order of nucleotides in the DNA). Until 1970, only small RNAs could be sequenced, such as for example tRNAs of 70 to 90 nucleotides which were small and easy to obtain in large quantities. The RNAs were usually broken into pieces of a few nucleotides and each fragment was sequenced by chromatographic methods. The first sequence of an intact molecule, a yeast tRNA of 80 bases was sequenced in 1965. DNA sequencing became possible with the discovery of restriction enzymes and DNA polymerases around 1970. For the first time well-defined fragments could be derived from larger molecules and the 5.4 kb genome of bacteriophage phiX was sequenced in 1976.

A major breakthrough came with the development of the dideoxy chain termination (Sanger) and chemical degradation (Maxam and Gilbert) methods in 1977. The Sanger dideoxy method was successfully used to sequence the human mitochondrial DNA of 16.5 kb in 1981 and the Maxam Gilbert method was used to sequence the 40 kb bacterophage T7 DNA in 1983. When the dideoxy method was introduced the rate of sequencing was only about 1.5 kb per person and per year. At this time, sequencing one gene could be the basis for an entire PhD-thesis! Fred Sanger received the Nobel Prize in 1981 for the development of the Sanger sequencing method (PHOTO: FRED SANGER), which is even today, 20 years later, the most commonly used method in DNA sequencing.

With the development of M13 cloning vectors and oligonucleotide synthesizers the dideoxy method became universally applicable to any DNA fragment. Several new methods were developed in the late 1980s to automate gel electrophoresis, raw data acquisition and base calling. Computer-operated robotic workstations and sophisticated software considerabley accelerated the rate at which sequence data could be generated. By the early 1990s the enhanced throughput had increased to approximately 100 kb per person and per year. Further important developments were the amplification of tiny quantities of DNA by the polymerase chain reaction (PCR). The combination of PCR and the dideoxy method has allowed the development of cycle sequencing, a process that has tremendously increased the sensitivity of sequencing. The developments in sequencing and bioinformatic technologies enabled the first bacterial genome, the 1.7 Mb genome of Haemophilus influenzae, to be competely sequenced by TIGR in 1995. By now, more than 60 complete genome sequences have been published. (PHOTO: DNA sequence-trace profiles).

Genome sequencing projects presents a set of special problems. Since the length of sequence data obtained from a single experiment is currently limited to approximately 500 bases, determination of larger genomic sequences requires a strategy to merge overlapping sequence fragments (assembly). Both ordered and random strategies have been used. The ordered sequencing strategy facilitates the organization and assembly of raw sequence data, but is slow and difficult to automate. The random strategy, which relies on sophisticated software for sequence assembly and editing, is easy to automate and data accumulates quickly during the early phase of the sequencing project. Random fragments are accumulated until the total size of the genome has been covered by five to ten times. One advantage of this approach is that the overdetermination of most of the sequences minimizes the frequency of errors in the final sequence. As a result, most of the current large-scale genome sequencing projects employs a random shotgun strategy.

Bioinformatics is required at several different stages during high-throughput DNA sequencing. First, the signals passing the detector of the flurorescent DNA sequencing instrument must be automatically captured and converted into a stream of nucleotides (base calling). Second, sophisticated software algorithms are required to assemble, edit, and compare the sequence data. Assembly programs must deal with the fact that each base in the final sequence has been experimentally determined many times, but that any base can be associated with experimental errors. Thus, sequence-assembly programs have to handle sequence data from thousands of individual sequence reactions and use information on the confidence of individual base calls. They must also automatically deal with sequence errors during alignment and contig building and assign a confidence estimate to each base in the final consensus sequence. Finally, they may need to incorporate additional information such as clone length and location of individual reactions. In order to achieve the production levels required for sequencing mammalian genomes, all the software from base calling to the final sequence assembly must be automated with as little human intervention as possible.


2.1 Data management

The efficiency of large-scale sequencing easily enables quantities of experimental samples and data that are impossible to track and manage with traditional laboratory notebooks. Large-scale sequencing projects have storage requirements for local laboratory data of hundreds of megabytes to hundred of gigabytes. A laboratory information-management system should ideally record experimental procedures and data, automate routine data-analysis procedures, support project analysis and management, quality control and trouble-shooting, and provide for the automated export of data to the relevant public databases. Large-scale research projects also need to be able to follow a data trail from the final published conclusion, back through numerous infernces to the large amount of primary sequence data supporting them. However, data-management system for the molecular biology laboratory has not yet been developed, presumably because of the difficulty in representing the wide variety of experimental methods and data produced in the different laboratories.


2.2 Making sense of the data

The raw sequence data is meaningless in the form of naked base pairs, even if those count in numbers of thousands or millions. The availability of robust algorithms for predicting coding regions is required in order to convert megabase pairs of sequence data into important biological insights. However, sequence data analysis and interpretations is a painful compromise between what is desired and what is realistic. Some of the problems that the molecular biologists encounters are the following:

  • Protein coding regions have to be found and start and stop signals have to be identified. However, mammalian genomes such as for example the human genome contains much more DNA than is needed, so any random piece of DNA may not encode any protein at all.

  • Protein-coding genes in eukaryotes are not continuous, but are frequently scattered into blocks (exons) that are interrupted by DNA that does not code for anything (introns). To identify the boundaries of exons and introns is not trivial and most computer programs do not define exon/intron boundaries very precisely.

  • Not all of the coding DNA codes for proteins, some pieces of the DNA codes for structural RNA such as ribosomal RNA (rRNA), transfer RNA (tRNA) and small nuclear ribonucleotideproteins (snRNA). Special computer programs are required for the identification of RNA-encoding regions in the DNA.

  • The definition of noncoding functional regions such as genetic regulatory regions is particularly problematic. Computational approaches to the identification of regulatory sites are based on the assumption that there are sequences in the DNA or RNA that are recognized by the corresponding DNA- or RNA-binding proteins. While for example restriction sites can be so precisely defined that identification is a simple matter of string matching, most recognition sites are defined by only a small number of invariant residues. However, if a consensus sequence can be defined a number of standard algorithms can be used to locate the site of interest in an unknown sequence.


Despite these problems, by combining a variety of computational methods with some experimental work, molecular biologists have been fairly successful at identifying coding and regulatory regions. As soon as a pattern has been recognized a weight matrix can be used to represent the frequencies of each nucleotide at each position in the site. Consensus sequences and weight matrices are special cases of pattern recognition in which humans plays a large part in defining the pattern. A more general approach is to train a program to discriminate between sequences that are known to contain a particular site or pattern, and those that do not have the site. For example, neural networks have been trained to recognize signals and patterns in sequences. The development of computer methods for the identification of coding regions is currently considered to represent one of the most important areas of research in bionformatics.


3.1 Predicting functions

The next step is to attempt to predict putative gene functions by finding a similar sequence that has been experimentally studied in another organism. For example, a quarter of cloned human genes display sequence similarity to yeast sequences with known functions. An educated guess is that those human genes have functions similar to the corresponding genes in yeast.

For molecular biologists it is thus extremely important to have access to public databases with the most recent information. The most widely known and used databases are

  • GenBank (NCBI) and EMBL (European Molecular Biology Laboratory), DDBJ (Databank of Japan) and GSDB (Genome Sequence Database). These are databases for storing DNA and RNA sequences.

  • SWISS-PROT and PIR (Protein Identification Resource). These are databases for storing protein sequence information.

  • PDB. These are databases for storing macromolecular structures.
In addition, there are numerous specialized databases covering a diverse array of areas. Taken together the databases provide valuable collections of organized data that is of broad benefit to molecular biologists. However, the number and diversity of information resources makes the discovery of these resources as important as to know how to use them. Thus, tools and systems to assist the researcher in navigating through the biological data are increasingly important.

BLAST and FASTA are programs for finding sequence similarities between a selected piece of DNA and the sequences contained in the databases. These programs report the hit in the database along with the estimated statistical signficance of the hit. According to probability theory, the similarity score required for a given level of statistical significance is proportional to the logarithm of the database size. Therefore, as the database grows, biologically significant matches of distantly related sequences may have smaller similarity scores than random matches, and may be lost in the noise. One approach to this problem is to simplify the database by eliminating redundant sequences or by reducing families of similar sequences to a single representative sequence in the database that can be used for the initial searching.

The result of the sequence similarity search provides initial information about the putative function of the gene product of interest. To find out more detailed information about the role of this protein in cellular metabolism, specialized metabolic pathway databases can be consulted.


3.2 Predicting structures

The rate at which newly determined protein structures are being published is also growing rapidly. Newly determined structures are often showing structural similarity to previously determined structures, even when no sequence similarity is detected. New algorithms allow protein structures to be compared with databases of all known structures. Such structure-based searches are often used as a tool for discovering biologically interesting relationships among proteins.

The major experimental methods for deducing macromolecular structures at atomic resolution are X-ray crystallography and nuclear magnetic resonance (NMR). Both methods produce extremely large amounts of sequence data and are entirely dependent upon the availability of powerful computers and sophisticated processing algorithms for the interpretation of these data. Combining structural information from several experimental techniques can often provide the basis for a structural solution where only partial data are available from any single technique. Improved prediction algorithms are providing new ways to tackle problems of data analysis in crystallography, and new, more carefully refined protein structures are providing new insights into the protein-folding problem, which is at the heart of structure prediction.

Currently, there are more than 15 million amino acid residues, representing more than 43 000 proteins, in the protein databases but only 3 500 atomic structures in the PDB. It is still experimentally much more easy to sequence a gene than to determine the structure of the protein it codes for. The ability to predict protein structures directly from amino acid sequences would be of great advantage to structure-function studies, as well as to the emerging fields of protein engineering and design. However, for proteins of the length found in living cells the number of possible conformations is astronomically high, making the problem of search through the conformational space essentially impossible with existing computers. A recent alternative approach inverts the protein-folding problem into the inverse-structure problem that can be described as follows: Given a structure, what sequences will fold into it? Will this sequence fold into a known structure?

Molecular dynamics approaches closely related to those used in protein-folding problem are also used in a variety of simulations of systems involving large biomolecules interacting in time. These include simulations of substrate binding, enzyme reactions, membranes and membrane proteins, protein-DNA interactions, muscle action, viral infection and DNA supercoiling.


4. Molecular evolution and phylogenetic trees

Sequence data is also being used to determine the evolutionary relationships of organisms. By determining the number of mutational changes in pair-wise comparisons, a quantitative measure can be obtained of the distance between any pair of sequences. These values can then be usedto reconstruct a phylogenetic tree that describes the relationships between the gene sequences. The resulting gene tree is then assumed to represent the phylogenetic relationships of the species that were the sources of the gene sequences.

The task is more daunting than may at first be realized. For ten different sequences there are a few million possible trees. But evolution has only occurred once! To identify the single correct tree from the millions of incorrect trees is like searching for a needle in a haystack. Therefore, the reconstructed trees should not be taken as the eternal truth! With some statistical signifance a tree has been reconstructed that resembles the true evolutionary pathway to some extent. An additional complication is that not all sequences have evolved at equal rates. Some genes change faster than other genes. Unequal rates of evolution may confuse the true relatedness of sequences. What we need are tree-construction methods that are efficient, powerful, consistent, robust and falsifiable.

The starting point for all attempts to evaluate the evolutionary relationships of sequences is to decide which nucleotide to compare to which other nucleotide. This is done with multiple-sequence alignment that refers to the search for similarity in three or more sequences. Multiple alignment methods are also used to search for functional regions in genomic sequences. In general terms, the aim is to align a set of sequences optimally in order to reveal the similarities that underlie a family of sequences, and to proved a basis for quantitative estimates of the differences among the sequences.

The tree-construction methods are usually divided into distance methods and character-state methods. The distance methods apply a variety of different algorithms onto a table with the overall pairwise distance measures derived from the entire genes. The character-state methods inspect distinguish informative sites (those that carry evolutionary information) from noninformative sites. Here, only the informative sites are used for the phylogenetic reconstruction. Recent progress has led to new approaches to tree inferences, increased understanding of the general properties of the methods, and methods for estimating the reliability of trees.


5. Challenges in bioinformatics

Biological research has long been characterized by a twofold approach (a) field observation, specimen collection and classification and (b) laboratory experimentation, often based on hypotheses formed by careful examination of specimen collections. This paradigm in biological research is currently shifting. The new paradigm is based on the availability of large volumes of gene and protein sequences in the databases. As the cloning and sequencing steps will have been accomplished en masse for all genes, individual researchers will not need to repeat them for each gene of interest, but will develop hypotheses for new experiments by mining the databases. In the new paradigm, the bulk of discoveries might be made by experiments conducted in silico, rather than in vivo or in vitro.

Bioinformatics occupies the interface between biology, computer science, applied mathematics, statistics, and computer and software engineering. The interface occupies a cultural gap that can be described as a three-culture problem. Biologists want immediate solutions to their data-management and analysis problems. Computer scientists and mathematicians seek interesting basic research problems, and software engineers asking both groups for a sufficiently well defined specification for them to get on with building something useful. This bioinformatics culture-gap has two principal sources: significant differences in the vocabularies and modalities of scientific approach between the three groups and an underestimation on all sides of the effort required to efficiently bridge the gap.

To bridge this culture gap it is necessary to establish bioinformatics as an interdisciplinary profession through cross-training mathematicians, computer scientists, software engineers and database designers in one or more biological subdisciplines, simultaneously with cross-training molecular biologists and geneticists in computer science. However, as at the inception of any profession, many of the current scientists involved bioinformatics have little or no academic or professional training in bioinformatics, but have "learnt by doing".



Additional reading: You will learn more about the international databases and how to search them in chapters 4, 5 and 6 of this course.




Bioinformatics - a new multidisciplinary tool

Questions:

  1. Please, try to make your own definition of bioinformatics using no more than five sentences.
  2. What do you think the rapid increase in sequencing productivity has meant for society - and for the many small scientific research groups?
  3. Only people with skills in bioinformatics will be able to understand and interpret the information contained in for example the human genetic material. Who do you think should receive training in bioinformatics?
  4. Do you see any danger if there is only a small group of experts in a small number of countries that have expertise in bioinformatics?
Send your answers to background1.overview@artedi.ebc.uu.se.


This page was last updated: 2003-05-26 15:26 |
Webmaster