11.2. Primary sequence and three-dimensional structure databases

The most common use of bioinformatics for a biologist (and for a student of biology) is a search within primary molecular biological databases. These mostly include sequence and three-dimensional structure databases, and also experimental datasets provided by „omics” HTP methods (protein-protein interactions, large-scale mass spectrometry analysis and identification of lipids, sugars or small-molecule metabolites). The best known nucleotide sequence database is called GenBank, which is part of the Entrez bioinformatics web portal.

The most familiar protein sequence database is called UniProt, which is part of the ExPASy portal. The vast majority of amino acid sequences of polypeptides has been determined as nucleotide sequence and subsequently translated in silico (by bioinformatics tools) using the genetic code table. (Note that protein sequences determined on amino acid level are more relevant, since quite a few different functional proteins may originate from a single gene, due to e.g. alternative splicing and/or post-translational modifications. Actually there are many more proteins than genes!) Experimentally-determined three-dimensional structures of macromolecules (proteins, nucleic acids as well as protein-protein and protein-nucleic acid complexes) are stored in the Protein Data Bank (PDB). Secondary databases contain data from analysis of sequences and structures, and will be mentioned briefly later in the text.

The relationship between the informational macromolecules and the primary bioinformatics databases is summarised in Figure 11.1.

Relationship between informational macromolecules and primary bioinformatics databases

Figure 11.1. Relationship between informational macromolecules and primary bioinformatics databases

11.2.1. GenBank

GenBank (ncbi.nlm.nih.gov/genbank) is a DNA (nucleotide) sequence database maintained by the NCBI (National Center for Biotechnology Information), a US-government sponsored resource for bioinformatics information, which is part NIH (National Institutes of Health).

GenBank currently (late 2012) contains ~150 Gbp (150 billion bp) of information in 160 million sequence files. Only original, experimentally-derived sequences can be submitted to GenBank. It is a redundant database, meaning that a particular sequence can be determined by independent research projects (cloning of a single gene or by genome sequencing projects). GenBank continues to grow at an exponential rate, doubling every 18 months. Presently, the major sources of submitted sequences are genome projects (complete sequencing of the full genetic material of an organism). Up to now, more than a thousand genomes have been sequenced, including our own genome. The Human Genome Project (www.ornl.gov/Human_Genome), i.e. the sequencing of the 3.2-Gbp human haploid genome (the 23 chromosomes) was finished in 2003. More precisely, only the gene-rich euchromatin region of the chromosomes (~90%) were sequenced because the highly repetitive so-called constitutive heterochromatin (around the centromere and the telomeres of the chromosomes) cannot be cloned. The human genome sequence, and in fact most of the genome sequences, are freely available in GenBank and in other databases (e.g. Ensemble, GenCard). GenBank is an annotated database, i.e. the sequences are supplemented with explanations or commentaries on its information content (including the coding region, the source of the sequence, and related publications). Nucleic acid sequences and any analysis derived from those sequences can be published only after they have been deposited in a freely accessible database. The main page of NCBI is shown in Figure 11.2, while a sequence entry is shown in Figure 11.3. An online example of a sequence record (that of the human hemoglobin beta chain) is accessible here.

Newly determined nucleotide sequences can be identified by and compared using the GenBank database (using the BLAST program described in Chapter, and the results of this analysis are GenBank files identified by an accession code (e.g. D32013 in Figure 11.3).

The NCBI homepage

Figure 11.2. The NCBI homepage (http://www.ncbi.nlm.nih.gov/). A few databases that are mentioned in the text are marked.

An example GenBank file

Figure 11.3. An example GenBank file (DNA polymerase from Thermus aquaticus; accession code D32013)

GenBank is part of the Entrez web portal (www.ncbi.nlm.nih.gov/sites/gquery), which is a powerful web tool to search for a large number of bioinformatics databases maintained by NCBI. PubMed (ncbi.nlm.nih.gov/pubmed) is a bibliography database of life sciences and biomedical topics (covering practically all scientific journals in biochemistry and molecular biology). It contains more than 20 million bibliographical records of biomedical publications including free abstracts. More and more open access articles are freely available on the original journal websites in html format or downloadable as pdf file (directly accessed via PubMed). The Bookshelf online library contains many university textbooks in a fully searchable format (e.g. Stryer: Biochemistry, Lodish et al.: Molecular Cell Biology, Alberts et al.: Molecular Biology of the Cell). Part of the search page of the Entrez portal is shown in Figure 11.4.

The Entrez portal

Figure 11.4. Search page of the Entrez portal (databases mentioned in the text are highlighted) (www.ncbi.nlm.nih.gov/sites/gquery)

Among the databases accessible via the Entrez search engine, only the most important ones are mentioned here. Genome provides views for a variety of genomes, complete chromosomes, sequence maps from many organisms whose genome has been fully sequenced; dbEST (Expressed Sequence Tag) contains cDNA (complementary DNA) sequences that were reverse-transcribed from mRNA sequences (transcripts); OMIM (Online Mendelian Inheritance in Man) contains detailed, full-text, referenced overviews of all human Mendelian disorders (> 12,000 genes). The Ensembl (http://www.ensembl.org) database also contains genome sequences. It is maintained as a joint project by the European Bioinformatics Institute (EBI, http://www.ebi.ac.uk/), the European Molecular Biology Laboratories (EMBL) and the British non-profit Wellcome Trust Sanger Institute (named after the double Nobel prize laureate British scientist Frederick Sanger who developed protein as well as DNA sequencing methods). EMBL also maintains a nucleotide database (ENA: European Nucleotide Archive), which contains the same information as GenBank.

11.2.2. UniProt

UniProt (uniprot.org) is an annotated, non-redundant amino acid sequence database that actually consists of two sub-databases. The Swissprot division contains only experimentally validated and manually curated (annotated) protein sequences together with references to scientific publications (currently it contains more than 200 million amino acid residues in more than 500,000 annotated sequence files), while the TrEMBL division contains automatically translated sequences (currently more than 8 billion amino acid residues in approximately 24 million sequence files) from the EMBL nucleic acid database. Annotations of UniProt files include alternative versions of the particular sequence (alternatively spliced isoforms), other sequence variations (polymorphisms, mutations, sequence conflicts), information on the protein family to which the sequence belongs, structural and functional elements (motifs) of the polypeptide sequence, posttranslational modifications, cross-references to other databases (nucleotide sequence, structural and secondary databases) and, finally, literature references. An important part of the annotation is the so-called Gene Ontology (GO), a standardised vocabulary of the gene product across species and databases. It covers three attributes of the protein: the cellular component is the biological localisation of the protein (the parts of a cell or extracellular environment); the molecular function describes the elementary activities at the molecular level, e.g. binding or catalysis; and finally the biological process, functions in integrated living units: cells, tissues, organs, and organisms. An example UniProt file is shown in Figure 11.5 and Figure 11.6 (human skeletal muscle α-actin with accession code P68133). Protein sequences are referred to using their accession number (six alphanumeric characters) in research publications. The reader is encouraged to read the following short tutorial about the use of the UniProt database.

Example of a UniProt record I

Figure 11.5. Example of a UniProt record (name, general annotation) (http://www.uniprot.org/uniprot/P68133)

Example of a UniProt record II

Figure 11.6. Example of a UniProt record (secondary structure and amino acid sequence) (http://www.uniprot.org/uniprot/P68133)

The UniProt database is part of the ExPASy (Expert Protein Analysis System; expasy.org/) bioinformatics resource portal, which provides access to scientific databases and software tools to different areas of life sciences including proteomics, genomics, transcriptomics and systems biology. Moreover, it is an entry point to many other secondary databases. For instance, proteomics tools include online programs of DNA-to-protein translation, calculation of the molecular mass and isoelectric point of proteins, prediction of structural and functional motifs, posttranslational modifications and three-dimensional structure. A screenshot of the portal is shown in Figure 11.7 (highlighted are databases and tools described in the text).

The ExPASy portal

Figure 11.7. Screenshot of the ExPASy bioinformatics resource portal (www.expasy.org)

11.2.3. Protein Data Bank (PDB)

PDB (www.rcsb.org/pdb) is a database of experimentally-determined three-dimensional structures of proteins, nucleic acids and their complexes. Currently it stores nearly 80,000 structures determined by X-ray diffraction and approximately 10,000 structures determined by nuclear magnetic resonance (NMR) spectroscopy. (These two methods can be used to determine atomic-resolution structures of biological macromolecules.) The annotated PDB files contain additional useful information beyond the Cartesian atomic coordinates of the three-dimensional structures. The main page of the PDB website and details of an entry are shown in Figure 11.8 and Figure 11.9. PDB entries have a unique identification code consisting of a number and three letters (e.g. 1GFL is the PDB code of a Green Fluorescent Protein structure shown in Figure 11.9). Three-dimensional structures can be visualised online by using the Jmol applet (integrated into web browsers; see in Chapter 11.4.3). Alternatively, the PDB file can be downloaded and utilised by any of the freely available (open-source) molecular graphics programs (see Chapter11.4).

Molecule of the Month regularly describes the structure and function of an interesting or important molecule. It is part of the PDB-101 interface, an educational resource for exploring a structural view of biology. It is highly recommended to download and study the poster “Molecular Machinery: Tour of the Protein Data Bank” that illustrates 80 PDB entries (enzymes, membrane proteins, motor proteins, DNA-binding proteins, protein complexes such as ribosomes) alongside water and ATP at a scale of one to three million.

The PDB website

Figure 11.8. Web page of PDB (http://www.rcsb.org), a repository for three-dimensional structural data of large biological molecules such as proteins and nucleic acids

Example of a PDB entry

Figure 11.9. Details of a PDB entry (Green Fluorescent Protein) (http://www.rcsb.org)