OrthologID was developed as a collaborative project by the New York Plant Genomics Consortium, and supported by NSF Plant Genome Grant DBI-0421604, NSF SGER Grant DBI-0326436, and NSF Plant Genome Grant DBI-0922738.
GFC clusters genes from complete genomes into gene families. GFC searches each ingroup gene against both ingroup and outgroup genomes using NCBI BLAST (Altschul et al. 1997). For clustering purpose, an expectation value cutoff of 1e-20 is used. For a pair of genes g1 and g2, g1 is defined as clusterable with g2 if the e-value in the BLAST of g1 against g2 is within the aforementioned cutoff, and the alignable regions of the two genes are at least 80% of the longer sequence. The latter criterion is used to avoid the clustering of genes that only share one structural domain with high sequence similarity. A gene g is a member of the gene family F if at least one other gene in F is clusterable with g. After performing all-against-all BLAST searches, GFC randomly picks a gene g from one of the ingroup genomes, and looks for clusterable genes in the BLAST result of g. Each clusterable gene is added to the current family, and this gene’s BLAST result is again searched for new members. This process is repeated until no more genes can be clustered to the current family. GFC then starts a new gene family, and the above steps are repeated.
Algorithmically, GFC treats each gene g_i as a vertex in a directed graph G. An edge exists from g_i to g_j if g_j is clusterable with g_i. The clustering algorithm starts with a vertex that has not been visited, and traverses the graph G in a depth-first manner. Each gene encountered during the traversal is added to the current family F. If a gene that belongs to a previously constructed gene family F’ is encountered, F’ is merged into F. This process is iterated until all vertices in G have been visited. Gene family membership in the OrthologID database is based on the above criteria.
The alignment constructor makes use of the MAFFT L-INS-i algorithm (MAFFT version 5; Katoh 2005), which is an iterative refinement method with local pairwise alignment information. The alignment constructor uses different sets of alignment parameters to create three different alignments for each gene family. The three pairs of gap open penalty and offset values are (1.53, 0.123), (2.4, 0.1), and (1.0, 0.2). Alignments are compared and alignment-ambiguous regions culled (Gatesy 1993). The resulting, culled alignment is then passed on to the tree builder.
The Tree Builder module generates phylogenetic trees within a parsimony framework. Where possible (for small gene families with fewer than 12 sequences), exhaustive, or branch and bound tree searches are performed (as implemented in PAUP*; Swofford 2003). For alignments with larger numbers of sequences, tree space is rigorously explored using the parsimony ratchet (Nixon 1999). . Each iteration of a ratchet starts with a limited TBR search to generate an initial tree. This tree is used as a “starting tree” for a search with 10-15% of characters reweighed. The shortest tree is again used as a starting tree to perform another TBR search with all the weights reset. Each ratchet consists of 200 such iterations. The Tree Builder computes 20 ratchets and performs a final TBR swap on the best trees, in order to visit multiple islands of tree space. Where more than one equally parsimonious tree results from the analysis, a strict consensus is computed.
Diagnostics Generator identifies diagnostic characters for orthologous groups using the CAOS algorithm (Sarkar et al. 2002).
Complete Genomes (ingroup):
Complete Genomes (outgroup): Chlamydomonas reinhardtii (source: JGI)
Incomplete Genome (outgroup): Physcomitrella patens (source: NCBI)
Number of analyzed sequences: 137,641
Number of gene families/phylogenetic trees: 8,314
Previously, to place query sequences (e.g. EST) into orthology groups using a character-based approach requires manual rebuilding of gene family trees for each new query to be classified. OrthologID overcomes this limitation by classifying query sequences using the CAOS algorithm (Sarkar et al. 2002) and the “guide tree” approach. CAOS is a rapid algorithm for determining gene orthology based on derived traits shared between orthologous genes. By incorporating the CAOS algorithm, OrthologID classifies new query sequences (full length cDNA or EST) from genomes that are not completely sequenced, based on the phylogenetic and orthology relationships that are already determined through the analysis of complete genomes.
In the guide tree/CAOS approach, a complete parsimony gene family tree generated by OrthologID to identify orthologous groups from complete genomes is used as a “guide tree” for query classification. This guide tree is fed to the CAOS algorithm for the identification of diagnostic characters that define each node. In order to place query sequences into orthology groups assembled from complete genomes, CAOS screens the query sequence for the presence of characters that are diagnostic of nodes on the guide tree, and places the query into orthology groups from complete genomes.
Joanna C. Chiu, Ernest K. Lee, Mary G. Egan, Indra Neil Sarkar, Gloria M. Coruzzi, and Rob DeSalle. OrthologID: automation of genome-scale ortholog identification within a parsimony framework Bioinformatics (2006) 22(6): 699-707 doi:10.1093/bioinformatics/btk040