Gene identification and protein classification in microbial metagenomic sequence data via incremental clustering

Open Access

10 April 2008

journal article
research article
Published by Springer Science and Business Media LLC in BMC Bioinformatics

Vol. 9 (1), 182
https://doi.org/10.1186/1471-2105-9-182

Abstract

The identification and study of proteins from metagenomic datasets can shed light on the roles and interactions of the source organisms in their communities. However, metagenomic datasets are characterized by the presence of organisms with varying GC composition, codon usage biases etc., and consequently gene identification is challenging. The vast amount of sequence data also requires faster protein family classification tools.

Keywords

This publication has 45 references indexed in Scilit:

Quantitative assessment of protein function prediction from metagenomics shotgun sequences
Proceedings of the National Academy of Sciences of the United States of America, 2007
Use of simulated data sets to evaluate the fidelity of metagenomic processing methods
Nature Methods, 2007
Database resources of the National Center for Biotechnology Information
Nucleic Acids Research, 2006
MetaGene: prokaryotic gene finding from environmental genome shotgun sequences
Nucleic Acids Research, 2006
MUSCLE: multiple sequence alignment with high accuracy and high throughput
Nucleic Acids Research, 2004
Community structure and metabolism through reconstruction of microbial genomes from the environment
Nature, 2004
The Pfam protein families database
Nucleic Acids Research, 2004
Unravelling the ORFan Puzzle
Comparative and Functional Genomics, 2003
Gapped BLAST and PSI-BLAST: a new generation of protein database search programs
Nucleic Acids Research, 1997
CATH – a hierarchic classification of protein domain structures
Structure, 1997

Cited by 43 articles