GeneQuiz is a tool that, given a DNA or protein sequence from a known or unknown gene, runs numerous queries and calculations in batch and then integrates the alternate transcript, sequence comparison, function and structure predictions into a single report.
ProteinKeys is a tool that, given a protein family multiple sequence alignment and optional protein structure, predicts functional residues in a protein based on entropy analysis. ProteinKeys includes a multiple alignment and an integrated 3D protein structure viewer (JMol).
Basic Local Alignment Search Tool. BLAST follows a similar scheme to FASTA in that it relies on a core similarity, although with less emphasis on the occurrence of exact matches. This program also aims at identifying core similarities for later extension. The NCBI has an extensive BLAST tutorial. BLAST is significantly faster than FASTA without loss of sensitivity or specificity for closely related sequences.
The FASTA program sets a size k for k-tuple subwords. The program then looks for diagonals in the comparison matrix between query and search sequence along which many k-tuples match. This can be done very quickly based on a preprocessed list of k-tuples contained in the query sequence. FASTA is slower than BLAST but may generate better results for more distantly related sequences.
ClustalW is one of the most widely used of the progressive alignment strategies. The idea is to take an initial, approximate, phylogenetic tree between the sequences and to gradually build up the alignment, following the order in the tree.
A local alignment approach is implemented in the DIALIGN program to construct multiple alignments based on segment-to-segment comparisons instead of residue-to-residue comparisons. The basic idea is to build sequence alignments by comparison of whole segments (i.e. uninterrupted stretches of residues) rather than single residues. Thus, Dialign is a local algorithm. Initially, all pairwise alignments are performed and all aligned ungapped regions picked up. The name Dialign comes from these regions as they would appear as diagonals on a dot plot. A consistent set of diagonals is determined with maximum sum of weights.
Partial order alignment (POA) is a new progressive approach to multiple sequence alignment (Lee et al., 2002). It utilizes partially ordered graphs, as opposed to generalized profiles, to represent aligned sequences. Unlike generalized profiles, partially ordered graphs can represent global cut-and-paste operations which, in theory, reflects the biological contents of multiple alignments more accurately. No evolutionary tree is used in this method to guide the order in which sequences are aligned. Problems caused by the inherent loss of information in generalized profiles are therefore avoided. The two most similar sequences are determined and aligned and all other sequences are added to this one profile in a stepwise fashion.
T-Coffee is a progressive alignment strategy with an ability to consider information from all of the sequences during each alignment step, not just those being aligned at that stage. Thus, it attempts to minimize the greediness of progressive alignment strategies.
Blocks is a large collection of ungapped multiple sequence alignments corresponding to the most conserved regions of protein families.
CDD - Proteins often contain several modules or domains, each with a distinct evolutionary origin and function. CDD currently contains domains derived from two popular collections, Smart and Pfam, plus contributions from colleagues at NCBI, such as COG. The source databases also provide descriptions and links to citations. Since conserved domains correspond to compact structural units, CDs contain links to 3D-structure via Cn3D whenever possible. The CD-Search service may be used to identify the conserved domains present in a protein sequence.
InterPro is an integrated documentation resource for protein families, domains and sites.
PROSITE is the oldest of the sequence-motif databases. This database uses single consensus patterns and profiles to characterize each family of sequences.
Pfam is a large collection of multiple sequence alignments which utilizes profile hidden Markov models to find common protein domains and families.
Database of combinations of conserved motifs ("fingerprints") used to characterize a protein family refined by iterative scanning of a SWISS-PROT/TrEMBL composite. Fingerprints can encode protein folds and functionalities more flexibly and powerfully than can single motifs, full diagnostic potency deriving from the mutual context provided by motif neighbors.
ProDom is an automated collection of protein domains and families.
SMART (a Simple Modular Architecture Research Tool) uses profile hidden Markov models (HMMs) to find commonly occurring protein domains. This tool allows for the identification and annotation of genetically mobile domains and the analysis of domain architectures. More than 500 domain families found in signaling, extracellular and chromatin-associated proteins are detectable. These domains are extensively annotated with respect to phyletic distributions, functional class, tertiary structures and functionally important residues. Each domain found in a non-redundant protein database as well as search parameters and taxonomic information are stored in a relational database system. User interfaces to this database allow searches for proteins containing specific combinations of domains in defined taxa.
Repository for 3D biological macromolecular structure data.
Repository providing structural (and hence implied functional) assignments to protein sequences at the superfamily level. A superfamily contains all proteins for which there is structural evidence of a common evolutionary ancestor.
Web tool for the prediction of membrane-spanning regions and their orientation. The algorithm is based on the statistical analysis of TMbase, a database of naturally occurring transmembrane proteins. The prediction is made using a combination of several weight-matrices for scoring.
Web tool for the topology prediction of membrane proteins.
PROSITE is the oldest of the sequence-motif databases. This database uses single consensus patterns and profiles to characterize each family of sequences.
Web tools for the subcellular localization prediction of proteins.
Web tool for comparing a protein sequence to a genomic DNA sequence, allowing for introns and frameshifting errors.
Web tool for aligning a transcribed and spliced DNA sequence (mRNA, EST) with a genomic sequence containing that gene, allowing for introns in the genomic sequence (taking into account consensus splice signals) and a relatively small number of sequencing errors.






