COCO-CL: Hierarchical Clustering of Homology Relations Based on Evolutionary Correlations

Raja Jothi1,*, Elena Zotenko1,2, Asba Tasneem3,and Teresa M. Przytycka1,*
1
National Center for Biotechnology Information, National Library of Medicine, National Institutes of Health, Bethesda, MD 20894, USA
2
Boston University, Boston, MA, USA
3
Booz Allen Hamilton Inc., Rockville, MD 20852, USA
*Correspondence: jothi AT mail.nih.gov

Download | Abstract | Data Files

 

Download COCO-CL 


Abstract

Motivation: Determining orthology relations among genes across multiple genomes is an important problem in the post-genomicera. Identifying orthologous genes can not only help predictfunctional annotations for newly sequenced or poorly characterizedgenomes, but can also help predict new protein-protein interactions.Unfortunately, determining orthology relation through computational methods is not straightforward due to the presence of paralogs.Traditional approaches have relied on pairwise sequence comparisonsto construct graphs, which were then partitioned into putativeclusters of orthologous groups. These methods do not attemptto preserve the non-transitivity and hierarchic nature of theorthology relation.

Results: We propose a new method, COCO-CL, for hierarchical clustering of orthology/homology relations, and identificationof orthologous groups of genes. Unlike previous approaches,which are based on pairwise sequence comparisons, our method explores the correlation of evolutionary histories of individualgenes in a more global context. COCO-CL can be used as a semi-independentmethod to delineate the orthology/paralogy relation for a refinedset of homologous proteins obtained using a less-conservativeclustering approach, or as a refiner that removes putative out-paralogsfrom clusters computed using a more inclusive approach. We analyzeour clustering results manually, with support from literatureand functional annotations. Since our orthology determinationprocedure does not employ a species tree to infer duplicationevents, it can be used in situations when the species tree isunknown or uncertain.


Data files

  • cococlOnCOGs.txt - This file contains the results from one iteration of COCO-CL on the 4,873 manually curated COGs. Each line in this file is of the format COG#, #Proteins in cluster 1 (#species represented in cluster 1), Proteins in cluster 2 (#species represented in cluster 2), #common set of species represented in cluster 1 and 2, clustering bootstrap score alpha, putative duplication confidence socre sigma
  • inclusiveCOGs.txt - This file contains COGs that COCO-CL predicts to be inclusive (contain out-paralogs). COCO-CL predicts a COG to be inclusive if and only if the clustering bootstrap score (alpha) >= 0.75 and confidence score (or split-score) >= 0.5. There are a total of 749 COGs in this file.

 

eXTReMe Tracker