The constantly accelerating speed of microbial genome sequencing keep on flooding public databases with sequences of predicted proteins, only a small portion of which has been ever investigated experimentally or could be studied in detail any time soon. The only feasible way to annotate functions to these proteins is to predict them through bioinformatics analysis. Founded in 1997, the Clusters of Orthologous Groups of proteins (COGs) database has been a widely used tool for functional annotation. The COG database which are classified into 26 functional categories went through several updates, and gradually expanded its genome coverage to 62 organisms involved in 46 bacterial, 13 archaeal and 3 eukaryotic genomes.

The reasons why COG annotation is highly recommended by Creative Proteomics:

  • COGs depended on the analysis of entire microbial genomes (proteomes), which allowed reliable assignment of paralogs andorthologs for most genes using a simple method based on the search of triangles of bidirectional best hits. This methodenabled both recognition of distant homologs and separation of closely relevant paralogs.
  • Another important factor was the use of a family-based method whereby the functions of the characterized members of the protein family (COG) was harnessed to assign functions to the complete family and describe the potential functions when there were more than one.
  • In contrast to the protein domain databases, such as Pfam, SMART or CDD, most entries in the COG database were full-length proteins, which provided a distinct view at the microbial protein content and its evolution.
  • Finally, the membership of the COGs and the functional annotation were applied for careful manual curation which aimed at assigning biological functions to each COG term while avoiding annotation errors and over predictions

