Frequently Asked Questions

How do I cite PHOG?

If you use PHOG, please cite:

What is a PHOG?

We use the term PHOG to describe a group of orthologs defined using phylogenomic analysis. We make use of the phylogenetic trees in the PhyloFacts resource (Krishnamurthy et al., Genome Biol. 2006) to identify subtrees of orthologous sequences. We provide different types of orthology prediction, targeting different evolutionary distances and precisions. Super-orthology is the most restrictive definition and requires that al nodes on the tree linking super-orthologs must correspond to speciation events (Zmasek and Eddy BMC Bioinformatics, 2002). We also provide a tree-distance thresholded version of PHOG orthology detection (termed PHOG-T), to allow subtrees of super-orthologs to expand to include nodes representing possible duplication events. Three distinct thresholds have been provided in the PHOG webserver, corresponding to close (e.g., human-mouse), moderate (e.g., human-zebrafish) and distant (e.g., human-fruit fly) taxonomic relationships. Note that these thresholded variants will not restrict an orthology group to that distance; we include all orthologs defined within a subtree that meets either super-orthology or the tree-distance specified.

Details of the algorithm are available here.

What does it mean when I'm told that my accession or ID is not recognized?

Unrecognized identifier/accession can result for a number of reasons. First, you may have input an ID or accession for a nucleotide sequence, whereas we only handle protein sequences. Second, the ID or accession may be for a database that we do not currently handle. Finally, we do not recognize IDs or accessions that have been retired by their database of origin, e.g., retired GenBank IDs.

If this happens, you may want to try putting your protein sequence in FASTA format in the FASTA input box. In this case the PHOG server will run BLAST to find sequences in our database that are highly similar to your sequence. Once the PHOG server comes back with the table of BLAST results, find your sequence of interest in the table and click on the Orthologs link to view the orthologs of that sequence.

Please see the FAQ section "What kinds of inputs are accepted?" (below) for a list of accepted inputs and examples.

What kinds of inputs are accepted?

Allowed inputs -- for protein sequences only! -- include:

Disallowed inputs:

What is meant by “FASTA format”?

FASTA is a standard format for representing sequences. The first line is normally a header that describes the sequence. If present, the header must begin with a greater-than character: “>”. The sequence then follows on as many lines as necessary. For example, the human oxytocin receptor protein sequence from the UniProt database in FASTA format is:

>sp|P30559|OXYR_HUMAN Oxytocin receptor OS=Homo sapiens GN=OXTR PE=2 SV=2
MEGALAANWSAEAANASAAPPGAEGNRTAGPPRRNEALARVEVAVLCLILLLALSGNACV
LLALRTTRQKHSRLFFFMKHLSIADLVVAVFQVLPQLLWDITFRFYGPDLLCRLVKYLQV
VGMFASTYLLLLMSLDRCLAICQPLRSLRRRTDRLAVLATWLGCLVASAPQVHIFSLREV
ADGVFDCWAVFIQPWGPKAYITWITLAVYIVPVIVLAACYGLISFKIWQNLRLKTAAAAA
AEAPEGAAAGDGGRVALARVSSVKLISKAKIRTVKMTFIIVLAFIVCWTPFFFVQMWSVW
DANAPKEASAFIIVMLLASLNSCCNPWIYMLFTGHLFHELVQRFLCCSASYLKGRRLGET
SASKKSNSSSFVLSHRSSSQRSCSQPSTA

Why do you have so many restrictions on inputs?

Allowing additional input types means greater complexity in our database and a slower response time for the server. We are working on expanding the allowed inputs, but it will take time.

How were the threshold values for evolutionary distance determined in the thresholded PHOG search?

The threshold values available through our pull-down menu have been selected for their performance in detecting orthologs at particular evolutionary distances.

What does it mean if I submit an accession, ID or protein sequence and I'm told that my sequence is not in any PhyloFacts Orthology Group?

This means one of two things. Either your sequence is not in the PhyloFacts resource, or it is not in an orthology group that meets the PHOG requirements. We are continually expanding PhyloFacts, so it may be available in the future.

What differentiates PHOG orthology prediction from other methods?

Since orthology is a phylogenetic term, phylogenetic analysis is the most rigorous approach for identifying orthologs. However, the computational complexity of phylogenetic analysis drives most orthology prediction methods to rely on BLAST (e.g., InParanoid, OrthoMCL, COGs and KEGG); many of these methods are also restricted to analysis of fully sequenced genomes (e.g., InParanoid).

The most similar ortholog prediction method to PHOG is TreeFam, which also uses phylogenetic tree analysis to predict orthologs. There are four main differences between PHOG and TreeFam: (1) TreeFam is restricted to animals (non-animal genes are included as outgroup sequences only), while PHOG's taxonomic range is unrestricted. This enables PHOG to include more distantly related orthologs for some gene families. (2) TreeFAM uses species-tree/gene-tree reconciliation; by contrast, PHOG does not perform tree reconciliation. This enables PHOG to be used in cases where a species phylogeny is unknown or poorly resolved, but may also result in lower accuracy in some cases. (3) TreeFam has a manually curated section (TreeFam-A) while PHOG has no manually curated section. (4) Finally, PHOG allows users to select a target precision or evolutionary distance to suit their particular preferences. For instance, PHOG-s provides super-orthology prediction and is highly specific, but with lower recall than other methods (e.g., InParanoid and OrthoMCL). As super-orthology is a more restrictive relationship than orthology, PHOG-S may not include some of the orthologs selected by TreeFam, but those orthologs included by PHOG-S are less likely to have diverged functionally. By contrast, the thresholded versions of PHOG provide for much higher recall without a significant decrease in precision.

For further information please see the FAQ section on the comparison of PHOG with other methods.

What is the difference between an ortholog and a super-ortholog?

Orthology is defined as two genes whose most recent common ancestor represents a speciation event. The term super-orthology is a more restrictive definition, and comes from Zmasek and Eddy BMC Bioinformatics, 2002: for two genes to be super-orthologs, the path in the phylogenetic tree joining the two must pass only through nodes representing speciation. If two genes are super-orthologs, they are more likely to share a common function than if they are orthologs but not super-orthologs, since gene duplication events are disallowed on the evolutionary path between the two genes.

Why do I see multiple sequences from the same species in a super-orthology group?

We allow sets of sequences from a single species that are grouped together in a subtree to be included in a super-orthology group. Some of these subtrees may represent true inparalogs (sequences duplicated within an extant species), i.e., co-orthologs to other sequences in a PHOG. Other subtrees may represent different alleles or splice variants of the same gene. A large fraction represent duplicate entries for the same protein (potentially submitted by different sequencing centers or groups). For instance, as of 12/16/2008, the UniProt resource (which we use as a primary source of sequences in PhyloFacts protein families) contains over 85000 unique identifiers of human proteins, whereas the accepted number of unique human genes is under 25000. Other species can also have a mismatch between the number of actual genes and the number of proteins in the UniProt resource. To compensate for this, we label subtrees that contain sequences from one species only as putative inparalogs or representatives of the same gene, and allow these to be included in a PhyloFacts Orthology Group (PHOG).

Why is it that a sequence can belong to so many orthology groups, with different species showing up in each?

There are a few reasons for this. First, PhyloFacts orthology groups can be based on different domains in a protein as well as for entire domain architectures. Alignments that are restricted to subregions of a protein are called "local", and alignments that extend along the entire lengths (disallowing any significant insertions or deletions) are called "global". Sequences that are included in a global alignment will also be included in a restricted local alignment, but the reverse is not always true. The second main reason is that we have constructed protein family phylogenies starting with many different sequences. Although we try to minimize redundancy, the extended nature of many protein families means that some sequences are found in different trees. To compensate for this, we include for each putative ortholog a link to the most informative PhyloFacts orthology group (PHOG) containing it and the query (and potentially many other sequences). If many PHOGs exist that include both the query and the putative ortholog, we prioritize those derived from global alignments, and then select the one that has the largest number of sequences.

How does PHOG compare at ortholog identification relative to other methods?

There is no standard benchmark dataset for evaluating orthology prediction (presumably because the evolutionary history can only be predicted but is not known). Assessment of orthology predictions based on consistency of functional annotation (e.g., agreement with GO functions) is problematic because of circular reasoning (most functional annotations, including GO, are based on homology-based annotation).

To assess the expected accuracy of PHOG, we used a set of 100 human sequences and their predicted orthologs in mouse, zebrafish and fruitfly from the TreeFam-A manually curated dataset as a gold standard to assess the validity of orthology predictions by InParanoid, OrthoMCL, and different PHOG variants. Results show that PHOG-T(moderate) has the best performance overall, finding 74% of TreeFam-A orthologs at 86% precision. By contrast, OrthoMCL finds 76% of the TreeFam-A orthologs at a precision of 66%, and InParanoid finds 87% of TreeFam-A orthologs but at only 24% precision.

Method True Positives False Positives False Negatives Recall Precision
InParanoid 273 870 40 0.87 0.24
OrthoMCL 237 122 76 0.76 0.66
SCI-PHY 246 100 67 0.79 0.71
PHOG-S 185 11 128 0.59 0.94
PHOG-T(Close) 202 20 111 0.65 0.91
PHOG-T(Moderate) 232 38 81 0.74 0.86
PHOG-T(Distant) 274 175 39 0.88 0.61
Results comparing different variants of the PHOG algorithm, SCI-PHY, OrthoMCL and InParanoid against a dataset of 100 human proteins and orthologs identified in the TreeFam-A resource.

For reference, recall and precision are defined as follows:

Recall = TP/(TP+FN)

Precision = TP /(TP+FP)

where a True Positive (TP) is an orthology pair included in TreeFam-A that is also predicted by a method, a False Negative (FN) is an orthology pair included in TreeFam-A that is not predicted by a method (i.e., it is missed by the method), and a False Positive (FP) is an orthology pair predicted by a method that is not included in TreeFam-A.

Full details are available here.

Funding for PhyloFacts is provided by the National Science Foundation and by the National Institutes of Health.