Chapter 11. Bioinformatics

by László Nyitray

Table of Contents

11.1. Introduction
11.2. Primary sequence and three-dimensional structure databases
11.2.1. GenBank
11.2.2. UniProt
11.2.3. Protein Data Bank (PDB)
11.3. Introduction to bioinformatics analysis of sequences
11.3.1. Bioinformatics tasks during molecular cloning
11.3.2. Sequence similarity search and sequence alignment
11.3.3. Bioinformatics analysis of protein sequences
11.4. Visualisation of protein structures by molecular graphics programs
11.4.1. RasMol
11.4.2. PyMOL
11.4.3. Jmol

11.1. Introduction

In the last decade, more biology-related information has accumulated than in the preceding two and a half thousand years of the history of science. This new surge of information mostly consists of nucleic acid and protein sequences, primarily due to the fact that DNA sequencing has become a routine technique after the recombinant DNA revolution. Bioinformatics has emerged as a new field at the interface of informatics and molecular biology in the mid-1980s, with the aim of storing and analysing the huge amount of data provided by DNA sequencing. Bioinformatics combines mathematical algorithms, computer sciences and statistics (i.e. informatics methods) to derive knowledge from computational analysis of experimental biological data. From a molecular biological perspective, bioinformatics mostly deals with the storage, retrieval, and analysis of nucleic acid and nucleic acid-derived amino acid sequences of proteins. It has several specific subfields. For instance, structural bioinformatics deals with the in silico analysis of the three-dimensional structure of macromolecules. Beyond sequencing, a massive amount of data is produced by many other so-called high-throughput (HTP) methods that can be managed only by bioinformatics. These HTP methods include, just to mention a few, gene expression analysis, electrophoresis and mass spectrometry that generate data to establish genetic, metabolic, signal transduction, protein-protein and other interaction pathways and networks.

Bioinformatics provides the core toolbox for the emerging new field of systems biology. Systems biology aims to understand biology by a holistic approach and is based, among other things, on the enormous datasets supplied by HTP methods. These approaches expand the traditional reductionist approach of molecular biology. The „omics” fields are part of system biology that started with genomics (the genome is the full complement of genetic material within an organism) followed by other fields of study, named using language neologisms as proteomics (large-scale study of the proteome, the full complement of proteins within an organism), transcriptomics (the transcriptome is a full complement of transcribed RNA within an organism or cell type or a physiological state of a particular cell), interactomics (study of the interactome, protein-protein interactions within an organism or cell). One could continue with an „omics” list to study the complete set of small-molecule metabolites (metabolome), the complete set of lipids (lipidome), the entire complement of carbohydrates (glycome), the full set of protein kinase enzymes (kinome) and so on.

In this chapter, we will describe the so-called primary databases that contain nucleic acid and protein sequences as well as three-dimensional structures of macromolecules and their complexes. Moreover, we will give an introduction to in silico sequence (and structure) analysis. The role of bioinformatics in molecular cloning experiments (such as restriction mapping of DNA constructs, design of oligonucleotide primers) will be covered only briefly. (More details of recombinant DNA technological methods can be found in Chapter 10.) The first steps in sequence analysis are similarity searches and sequence alignments; programs to perform these analyses will be described. Data from sequence alignments can be used to construct phylogenetic trees and to infer evolutionary relationships among sequences (and among species). Principles of molecular evolution are not covered in this e-book. In silico methods will be discussed that are used to predict structural and functional motifs within nucleic acid and protein sequences. We must keep in mind that most of the sequence analysis data are predictions, and laboratory experiments should be conducted to validate them.

Although the three-dimensional structure (conformation) of a protein is determined by its amino acid sequence (recall the conclusive Anfinsen experiment proving that the polypeptide chain spontaneously folds into its native three-dimensional shape), currently this information can only be partially inferred from the sequence. Ab initio protein structure prediction is still in its infancy. On the contrary, the visualisation of protein conformation is a relatively simple task. If the atomic-resolution structure of a protein (or a nucleic acid or their complexes) has previously been determined (by X-ray crystallography, nuclear magnetic resonance spectroscopy or homology modelling), for its visualisation one needs a file containing the atomic coordinates of the structure and any of the several freely available molecular graphics programs to handle these coordinates. A few of the most popular such programs will be described at the end of this chapter. Practical problems and exercises in bioinformatics can be found in Chapter 12.