Position: Professor of Bioinformatics
Division: Biological Chemistry and Drug Discovery
Address: College of Life Sciences,
University of Dundee,
Dundee
Telephone: +44 1382 385860, int ext. 85860
Fax: +44 1382 385764
Email: g.j.barton@dundee.ac.uk
Website: The Barton Group
The completion in June 2000 of the first draft of the 3 Billion bases of DNA in the Human Genome was the most public demonstration that molecular biology had become a data intensive science. In today's “post-genome era” the DNA sequence of Human and other organisms is only the tip of an iceberg of data that includes information on gene expression (transcriptomics), protein expression (proteomics) and protein structure (structural genomics). These experimental techniques produce prodigious amounts of data that can only be organised, compared, understood and exploited to further scientific understanding and to cure disease by the development and application of advanced computational methods.
Bioinformatics is the research field that seeks to find computational ways of understanding biological systems. The subject is very broad and ranges from research in statistics and computer science, through software engineering and database development, to applications in specific biological systems. The possible biological applications are equally broad, from the study of populations through molecular structure and interactions, to simulations of metabolic and signalling processes. Figure 1 summarises how our research in bioinformatics interacts with these different research areas while the following sections outline some recent highlights.
The research in my group touches on many areas of bioinformatics, but our core goal is to develop effective methods to predict protein structure and function from the amino acid sequence. Figure 2 shows the overall relationship between a protein sequence, its structure and function. A protein is a molecule made up of up 10s to 1000s of up to 20 different types of amino acid linked end to end. The number and order of the amino acid sequence is coded for in an organism’s genome by its DNA sequence. The amino acid sequence determines how the protein molecule will fold up into a unique three-dimensional structure, and it is this structure that dictates the protein’s function. Protein structures are large and complex with many thousands of atoms arranged in space. However, the backbone of the protein chain is seen to adopt regular secondary structures that recur in proteins of all types. Although progress has been made towards predicting the full three-dimensional structure of a protein given just its amino acid sequence, this remains an unsolved problem. In contrast, over the last 20 years, my group has made significant advances in predicting the secondary structure of proteins, raising the accuracy from around 60% in 1987 to over 82% in recent work by Chris Cole on the JNet prediction algorithm [1]. Whenever we develop a new technique or database, we aim to make it widely available via the web. For secondary structure prediction, JNet is implemented in the JPred server at Dundee. Currently, JPred performs between 5,000 and 10,000 predictions per month for colleagues world wide, and also provides a key resource for our own studies.
The comparison of protein and nucleic acid sequences is a central technique in modern molecular biology. Alignments may be used to find similar sequences in a database search, and to show which parts of the sequences share similar properties. The similarities that are found provide a rich source of information for prediction of protein structure (as in JPred) or identifying the possible functions for a protein as described for Kinomer [9] and SMERFS [2].
Techniques to find and align similar sequences have long been a research topic in my group. One technique that has been continuously developed since 1989 is the iterative search method SCANPS [5]. SCANPS combines an exhaustive alignment algorithm with statistics tuned to each individual database search. This combination makes SCANPS more sensitive than PSI-BLAST when assessed by domain searching benchmarks. Acceptable search times are maintained by exploiting on-chip and multi-processor parallel processing techniques. SCANPS produces sequence profiles and a multiple sequence alignment made relative to the query sequence. SCANPS is run as a service by the European Bioinformatics Institute and at Dundee with a novel interface to the scop protein domain database, developed by Tom Walsh .
Since we work a lot with sequence data, and in particular with multiple sequence alignments, we need good tools to view and manipulate alignments. The Jalview multiple alignment workbench was first developed in my group in 1996 to help us visualise the results of methods for predicting secondary structure and functional sites. The program has been widely distributed and appears on many thousands of web pages. Over the last three years, with funding from BBSRC, Andrew Waterhouse and Jim Procter have added major new features. Jalview allows large multiple alignments to be manipulated interactively and overlaid with sequence features and the result of more complex and time-consuming computational analyses. The current Jalview may be downloaded for free from www.jalview.org. Jalview runs on Windows, Mac and Linux and other platforms that support Java. Jalview can gather information from DAS servers around the world as well as access multiple alignment and secondary structure prediction algorithms via SOAP webservices. Figure 3 illustrates a typical Jalview screenshot.
Sensitive sequence searching methods such as SCANPS or PSI-BLAST will correctly identify many members of a protein family. However, for larger protein families, it is important to sub-classify the proteins into smaller groups based on specificity or function. One approach we have applied to a number of protein families is sub-family specific Hidden Markov Models (HMMs). The protein Kinases are a large family of proteins that are of central importance key cellular processes and disease. Work by Diego Miranda-Saavedra built a sub-family specific HMM library that on well-documented genomes was shown to classify the majority of kinases to the correct sub-family [6]. Given this new technique, Diego was then able to apply it to a range of newly sequenced genomes and rapidly identify and compare the full kinase complements, the “kinomes” of those genomes. Two important examples were the kinome of the sexually transmitted pathogen Trichomonas vaginalis [9] and of the filarial nematode parasite Brugia malaya [7]. As with our other work, a database of kinase classifications is made available via our web site.
An accurate protein multiple sequence alignment is the basis for many analyses. Perhaps the simplest is to identify residues that are identical across all members of the alignment. If the sequences are diverse, identical conserved amino acids suggest a common function for that position in the sequence. For an enzyme, this might be amino acids involved in catalysis. More subtle signals can be extracted from the sequences by considering the conservation of residues by their physico-chemical properties between sub-families. SMERFS [2] looks for local regions of a protein alignment where the pair-wise similarities are similar to those found when comparing the full-length sequences. These regions are likely to be those most important to defining the specificity of the different proteins in the alignment. SMERFS was tested on the problem of identifying protein-protein and protein-ligand binding sites from the amino acid sequence with some success. You can try SMERFS here.
Protein three-dimensional structures determined by X-ray crystallography give the most detailed picture of how two or more proteins interact. Although only a small subset of protein complexes have been determined at high-resolution by X-ray methods, interactions between proteins similar to those of known structure can be inferred by similarity as summarised in Figure 4. Protein structures are stored in the protein data bank (PDB) as the contents of the asymmetric unit rather than the biologically active molecule, but Emily Jefferson discovered that 34.5% more interactions could be found by examining the biological units [11]. The SNAPPI-DB database system [8] developed for this study and other work by us in this area, is available to query and download from our site.
X-ray crystallography is the major technique used to determine the detailed three-dimensional structure of biological macromolecules. In order to perform crystallography on a protein, the protein must first be made, then purified and persuaded to crystallise. One strategy to improve success is to try the orthologous proteins from different species, while another is to prune the protein to remove flexible and other regions that might make crystallisation difficult. In order to help with orthologue selection as part of the Dundee / StAndrews / Warwick BBSRC Structural Proteomics of Rational Targets (SPoRT) consortium, Ian Overton developed two novel techniques to predict how likely a protein is to succeed in a high-throughput crystallisation experiment: the OB-Score [12] and Parcrys [3]. Both methods provide a ranking of protein targets, but in benchmarks, Parcrys out-performs the OB-score and other methods. You can try both methods at: www.compbio.dundee.ac.uk/xtal
Although listed here under “structural bioinformatics”, this could equally well be under “sequence analysis”. There are a very large number of different analyses that can be carried out on a protein sequence or sequence alignment, but it can be difficult to draw all the results of these methods together in one place. As summarised in Figure 5, given a protein sequence, TarO performs a wide range of sequence database searches and analyses and presents the results in a table sorted by Parcrys score [3]. Further results are presented on an annotated multiple sequence alignment that is viewed in Jalview. The aim of TarO is to simplify the process of selecting orthologues for study by crystallography and optimising the domain boundaries for construct design. However, the TarO system is general and can be applied to any protein analysis problem. To try TarO see: www.compbio.dundee.ac.uk/taro
A common step towards understanding the function of a protein is to identify which other proteins it interacts with. The full detail of a protein-protein interaction may be revealed by X-ray crystallography as stored in SNAPPI-DB above. However, structural data are available for a small proportion of all known proteins, so protein-protein interactions are normally identified at lower resolution by a variety of non-crystallographic experimental techniques. Recently, a number of studies have applied high-throughput experimental methods to identify potential interacting partners for all proteins in an organism’s complete set of proteins - its “proteome”. In Michelle Scott’s work [10], information about a protein is combined from a variety of different sources within a Bayesian statistical framework as summarised in Figure 6. The resulting predictions enable a probability of interaction to be assigned to each protein pair in the Human proteome. The database of predicted interactions may be queried at www.compbio.dundee.ac.uk/www-pips and is undergoing continual development.
In 2008, new DNA sequencing techniques such as Solexa and 454 have started to provide sequence data at speeds orders of magnitude faster and at costs orders of magnitude lower than possible with conventional sequencing technology. These developments mean that within five years it will become routine to sequence the entire genome of an individual to identify likely roots of disease susceptibility. Already, this technology is revolutionising how many problems in molecular biology are addressed experimentally. However, rational use of the technology requires a high degree of computational competence. Accordingly, in parallel with our research in sequence and structure analysis, we are working in collaboration with a number of experimental groups on new ways to interpret data from high-throughput sequencing and also data from quantitative mass-spectrometry. These techniques, when coupled with smart computing methods, promise over the next few years to reveal new insights into the functioning of the cell in normal and disease states.
We collaborate with a large number of groups across the College of Life Sciences including those of Mike Ferguson, Angus Lamond, Bill Hunter, Gyorgy Hutvagner, Ron Hay, Jeff Williams, Inke Nathke, Alan Fairlamb, Anton Gartner, Arno Muller, Mike Stark, Andy Flavell and Tomo Tanaka. As outlined in Figure 1, Addressing the specific biological problems important to each group suggest gaps in our understanding of how proteins function and so promotes us to perform new general studies. In turn these lead to the development of new and improved predictors that we can apply to the specific systems of interest to our wet-lab colleagues.
A more comprehensive description of our work can be found on the group web site www.compbio.dundee.ac.uk together with access to our full publications list, web-accessible software, databases and downloads.
1. Procter, J.B., J.D. Thompson, I. Letunic, C. Creevey, F. Jossinet, and G.J. Barton, Visualizaton of multiple alignments, phylogenies and gene family evolution. Nature Methods, 2010, in press.
2. Ono, M., K. Yamada, F. Avolio, M.S. Scott, S. van Koningsbruggen, G.J. Barton, and A.I. Lamond, Analysis of Human Nucleolar snoRNAs and the Development of snoRNA Modulator of Gene Expression (snoMEN) Vectors. Mol Biol Cell, 2010, in press.
3. Waterhouse, A.M., J.B. Procter, D.M. Martin, M. Clamp, and G.J. Barton, Jalview Version 2--a multiple sequence alignment editor and analysis workbench. Bioinformatics, 2009. 25(9): p. 1189-91.
4. Scott, M.S., F. Avolio, M. Ono, A.I. Lamond, and G.J. Barton, Human miRNA precursors with box H/ACA snoRNA features. PLoS Comput Biol, 2009. 5(9): p. e1000507.
5. McDowall, M.D., M.S. Scott, and G.J. Barton, PIPs: human protein-protein interaction prediction database. Nucleic Acids Res, 2009. 37(Database issue): p. D651-6.
6. Martin, D.M., D. Miranda-Saavedra, and G.J. Barton, Kinomer v. 1.0: a database of systematically classified eukaryotic protein kinases. Nucleic Acids Res, 2009. 37(Database issue): p. D244-50.
7. Izquierdo, L., B.L. Schulz, J.A. Rodrigues, M.L. Guther, J.B. Procter, G.J. Barton, M. Aebi, and M.A. Ferguson, Distinct donor and acceptor specificities of Trypanosoma brucei oligosaccharyltransferases. EMBO J, 2009. 28(17): p. 2650-61.
8. Izquierdo, L., M. Nakanishi, A. Mehlert, G. Machray, G.J. Barton, and M.A. Ferguson, Identification of a glycosylphosphatidylinositol anchor-modifying beta1-3 N-acetylglucosaminyl transferase in Trypanosoma brucei. Mol Microbiol, 2009. 71(2): p. 478-91.
9. Golebiowski, F., I. Matic, M.H. Tatham, C. Cole, Y. Yin, A. Nakamura, J. Cox, G.J. Barton, M. Mann, and R.T. Hay, System-wide changes to SUMO modifications in response to heat shock. Sci Signal, 2009. 2(72): p. ra24.
10. Cole, C., A. Sobala, C. Lu, S.R. Thatcher, A. Bowman, J.W. Brown, P.J. Green, G.J. Barton, and G. Hutvagner, Filtering of deep sequencing data reveals the existence of abundant Dicer-dependent small RNAs derived from tRNAs. RNA, 2009. 15(12): p. 2147-60.