11.3. Introduction to bioinformatics analysis of sequences

11.3.1. Bioinformatics tasks during molecular cloning

When we design a recombinant DNA construct (see Chapter 10), it is important to know the potential restriction endonuclease recognition sites both in the vector and the insert. Finding these sites is a relatively simple bioinformatics task using online programs such as NEBCutter or RestrictionMapper. It is also useful to draw a map of the recombinant DNA we plan to construct with the help of downloadable or online programs such as pDRAW, BioEdit or SnapGene.

Another typical bioinformatics task is the design of oligonucleotide primers for sequencing, polymerase chain reaction (PCR) or site-specific in vitro mutagenesis. This can also be achieved by online programs such as Primer3, Oligo or OligoCalc.

11.3.2. Sequence similarity search and sequence alignment

We can do a similarity search to learn if our sequenced DNA can be found in a public nucleotide database (i.e. it has already been cloned by others) and/or whether it is evolutionally related (i.e. homologous) to other sequences. In a simple similarity search, one can compare a sequence with sequences found in an entire nucleotide database (see later the BLAST program), while for a homology search the method of choice is multiple sequence alignment by the ClustalW program. By comparing either nucleotide or amino acid sequences we can find homologs. If these are from different species (that had a common ancestor) but have identical or similar functions they are called orthologs; while those homologs that are found in the same organism and originate from a gene duplication event followed by divergent evolution within the species are called paralogs. We will not cover the construction of evolutionary trees in this e-book—one can learn about these in bioinformatics or evolutionary biology courses.

11.3.2.1. The BLAST program

If we sequence a DNA clone, the first bioinformatics analysis is a similarity search against a nucleotide database. The most widely used similarity search program accessible on the internet is BLAST (Basic Local Alignment Search Tool), which will be described here and will be used by the students during the laboratory practice. The BLAST program is available online at several servers including the one at NCBI: http://blast.ncbi.nlm.nih.gov/Blast.cgi.

BLAST uses a heuristic algorithm that makes it possible to search a huge database in a very short period of time by using a query sequence. The high speed of the algorithm stems from the fact that the query sequence is divided into short „words” that are used, instead of the full-length sequence, during the alignment process. These words are searched in the database first (called „seeding”, i.e. finding the best local alignments). The most relevant hits are then scored with the help of a scoring matrix, extended to neighbouring words, and finally assembled and compiled into a final list of similarity hits. It is important that the query sequences must be in the so-called FASTA format (FASTA was a previously popular but much slower similarity search program). The FASTA format is shown in Figure 11.10.

The FASTA sequence format

Figure 11.10. The FASTA sequence format

If we want to search using a nucleotide query sequence within a nucleotide database, we can use the BLASTN version of the program. If we have an amino acid sequence, we can search a protein database by the BLASTP version of the program. The BLASTX version of the program translates a nucleotide sequence in all six reading frames (three on each strand) and allows searching a protein database. Finally, with the TBLAST subprogram, we can search against a translated nucleotide database using either a protein (TBLASTN) or a nucleotide (TBLASTX) query sequence. These similarity search options are summarised in Figure 11.11.

Search possibilities in the BLAST program

Figure 11.11. Search possibilities in the BLAST program

The result of a BLAST analysis is a list a sequences from the searched database that show significant similarity to the query sequence. Besides the sequence identifiers of the similar sequence hits in the database, the final list of alignments contains a score number and a statistical significance number, the E-value. The E-value is a parameter that describes the number of hits one can expect to see by chance when searching a database of a particular size. It decreases exponentially as the score (S) of the match increases. Essentially, the E-value describes the random background noise. The lower the E-value, or the closer it is to zero, the more "significant" the match (E > 0.01 is usually considered to reflect a homologous, i.e. evolutionarily-related sequence). The score value is calculated based on the alignment, taking into account the gaps and the similarity of the amino acids at the aligned positions. The most often used similarity matrix (an amino acid substitution matrix) is the BLOSUM (BLOcks SUbstitution Matrix) matrix. The numbers within a BLOSUM are “log-odds” scores that measure, in an alignment, the logarithm of the ratio of the likelihood of two amino acids appearing with a biological sense and the likelihood of the same amino acids appearing by chance.

The similarity hits can be found and downloaded from the database using their accession number (identifier). BLAST hits are usually hyperlinked directly to the corresponding entries in the GenBank database where we can learn much more about the related sequences, the gene, cDNA and/or the coded protein. As we have already mentioned, the most comprehensive information on a given protein can be found in the UniProt database. In Figure 11.12, a detail of a BLAST run is shown in which the BLASTP program was used to search the UniProt database using a human skeletal actin query sequence.

Result of a BLAST run

Figure 11.12. Result of a sequence similarity search by the BLAST program (human skeletal muscle actin was used as a query sequence against the UniProt database)

It is important to note that, since 3-D structure is more conserved than primary structure, it is easier to recognise two related proteins by comparing their three-dimensional structure than their amino acid sequence. Obviously, it is more convenient to compare primary sequences, since they are available for much more proteins than the atomic-resolution structures. Similarity searches and protein structure comparisons are dealt with in more detail in bioinformatics (or structural bioinformatics) courses.

11.3.2.2. Multiple sequence alignment

More detailed sequence similarity analysis can be performed by creating multiple sequence alignments. By this method, three or more biological sequences (those of proteins or nucleic acids) of similar length are aligned to minimise gaps (insertions or deletions in one sequence compared to the others) and maximise the occurrence of identical or similar residues at the aligned positions.

From the output alignments, homology—i.e. the evolutionary relationships—between the sequences can be inferred. Moreover, the presence of conserved regions indicates conserved structural and/or functional elements (motifs) within the sequence. The most often used program for multiple sequence alignment is ClustalW that can be reached at the ExPASy portal (embnet.vital-it.ch/software/ClustalW.html) or via the web page of the European Bioinformatics Institute (ebi.ac.uk/Tools/msa/clustalw2/).

In Figure 11.13, human alpha and beta hemoglobin as well as the myoglobin sequences are aligned by the ClustalW program. The asterisks at the bottom line of the alignment indicate identical (fully conserved, i.e. invariant) residues in a given sequence position, while single and double dots refer to highly and moderately conserved (chemically similar) residues, respectively. Within the aligned sequences, the dashes indicate the „gaps” that were inserted in order to optimise the alignment.

Example alignment by the ClustalW program

Figure 11.13. Three polypeptide chains from the globin family aligned by the ClustalW program and shown together with their UniProt accession code (and short name in the database)

11.3.3. Bioinformatics analysis of protein sequences

The wide range of in silico analysis possibilities of protein sequences is summarised in Figure 11.14. Note that many of these analyses can be performed also with nucleic acid sequences. Sequences can be compared to each other and to full databases. The physical and structural/functional properties of polypeptide chains can be predicted via this analysis. Sequence comparisons (alignments) were described in the previous section (BLAST and ClustalW programs). During the so-called profile analysis, the analysed sequences are compared to secondary databases that contain information about protein structural families, structural and functional domains, modules, phosphorylation, glycosylation and other posttranslational modification consensus sequences. Many online programs are available on the internet that can search secondary databases. For instance, the InterProScan profile analysis program can be used to search the InterPro secondary database (in fact it is a „superdatabase” of several individual derived databases) maintained by the EBI. Another example is the PhosSitePlus database that can be searched by any query sequence to predict phosphorylation or other posttranslational modification sites.

Protein sequence analysis

Figure 11.14. The wide range of in silico analysis possibilities of protein sequences. (Most of these options are also available for nucleic acid sequences.)