Pairwise Alignment
The alignment of biological sequences for the purpose of assessing the degree of similarity (and conservation, in the case of amino acid sequences) and the possibility of homology began in the early 1970's. Since, it has become the core of numerous applications in sequence analysis including the functional annotation of genes and proteins, the analysis of protein domains, and the prediction of 3D structure due to homology. Many sophisticated computational methods in molecular biology such as multiple alignments, profile analysis, and threading use a pair-wise sequence alignment as a subprocedure.

A global alignment is the alignment of two sequences over their entire length. The first dynamic programming algorithm for the global alignment of two sequences was introduced by Needleman and Wunsch (1970). A local alignment is the alignment of some portion of two sequences. Smith and Waterman (1981) modified the Needleman-Wunsch method to calculate the score of the best alignment between two proteins. Thus, the optimal local alignment between a pair of sequences involves a simple modification to the Needleman-Wunsch method in which only the highest-scoring sub-segments of the two sequences are aligned.

A number of programs have been written to rapidly search a database for the sequence in question, the query sequence. The two most commonly used programs are BLAST (Altschul, et al. 1990) and FASTA (Lipman and Pearson, 1985). These programs are an ideal starting point to determine whether a related sequence, or a family of sequences, already exists in a database. The results from these programs will provide evidence of function, utility, and completeness of the gene product.

FASTA

The FASTA program sets a size k for k-tuple subwords. The program then looks for diagonals in the comparison matrix between query and search sequence along which many k-tuples match. This can be done very quickly based on a preprocessed list of k-tuples contained in the query sequence. The set of k-tuples can be identified with an array whose length corresponds to the number of possible tuples of size k. This array is linked to the indices where the particular k-tuples occur in the query sequence. Note that a matching k-tuple at index i in the query and at index j in the database sequence can be attributed to a diagonal by subtracting the one index from the other. Therefore, when inspecting a new sequence for similarity, one walks along this sequence inspecting each k-tuple. For each of them one looks up the indices where it occurs in the query, computes the index-difference to identify the diagonal and increases a counter for this diagonal. After inspecting the search sequence in this way a diagonal with a high count is likely to contain a well-matching region. In terms of the execution time, this procedure is only linear in the length of the database sequence and can easily be iterated for a whole database. Of course this rough outline needs to be adapted to focus on regions on diagonals where the match density is high and link nearby, good diagonals into alignments.

You can run FASTA on your own sequence by clicking on FASTA@UniversityofVirginia.

BLAST

The other widely used program to search a database is called BLAST (Basic Local Alignment Search Tool). BLAST follows a similar scheme in that it relies on a core similarity, although with less emphasis on the occurrence of exact matches. This program also aims at identifying core similarities for later extension. The core similarity is defined by a window with a certain match density on DNA or with an amino acid similarity score above some threshold for proteins. Independent of the exact definition of the core similarity, BLAST rests on the precomputation of all strings which are in the given sense similar to any position in the query. The resulting list may be on the order of thousand or more words long, each of which if detected in a database give rise to a core similarity. In BLAST nomenclature this set of strings is called the neighborhood of the query. The code to generate this neighborhood is in fact exceedingly fast.

The NCBI has an extensive BLAST tutorial. You can also visit their BLAST service web servers to run a variety of BLAST programs on a large number of up-to-date DNA and protein databases.

FASTA vs BLAST
  • BLAST and FASTA are equivalent for highly similar sequences.
  • BLAST is faster than FASTA without significant loss of ability to find the similar database sequences.
  • FASTA may be better for less similar sequences.
Some drawbacks to pairwise sequence similarity
  • BLAST and FASTA algorithms are suitable for determining highly similar sequences, but are not sensitive enough to capture highly divergent sequences. Thus, evolutionarily diverse members of a family of proteins may be overlooked.
  • Many proteins are multifunctional multi-domain proteins. Therefore, a high hit to one domain does not necessarily define function.
  • Best hit is frequently a hypothetical protein, poorly annotated, or has different function.
Note

Within the last decade, the sensitivity of sequence searching techniques has been improved by profile-based or motif-based analysis, which uses information derived from multiple sequence alignments to construct and search for sequence patterns. Unlike pairwise similarity, a profile or motif can exploit additional information, such as the position and identity of residues that are conserved throughout the family, as well as variable insertion and deletion probabilities.

Sloan-Kettering Institute Memorial Sloan-Kettering Cancer Center