Growth of novel protein structural data

27 February 2007

journal article
Published by Proceedings of the National Academy of Sciences in Proceedings of the National Academy of Sciences of the United States of America

Vol. 104 (9), 3183-3188
https://doi.org/10.1073/pnas.0611678104

Abstract

Contrary to popular assumption, the rate of growth of structural data has slowed, and the Protein Data Bank (PDB) has not been growing exponentially since 1995. Reaching such a dramatic conclusion requires careful measurement of growth of novel structures, which can be achieved by clustering entry sequences, or by using a novel index to down-weight entries with a higher number of sequence neighbors. These measures agree, and growth rates are very similar for entire PDB files, clusters, and weighted chains. The overall sizes of Structural Classification of Proteins (SCOP) categories (number of families, superfamilies, and folds) appear to be directly proportional to the number of deposited PDB files. Using our weighted chain count, which is most correlated to the change in the size of each SCOP category in any time period, shows that the rate of increase of SCOP categories is actually slowing down. This enables the final size of each of these SCOP categories to be predicted without examining or comparing protein structures. In the last 3 years, structures solved by structural genomics (SG) initiatives, especially the United States National Institutes of Health Protein Structure Initiative, have begun to redress the slowing growth of the PDB. Structures solved by SG are 3.8 times less sequence-redundant than typical PDB structures. Since mid-2004, SG programs have contributed half the novel structures measured by weighted chain counts. Our analysis does not rely on visual inspection of coordinate sets: it is done automatically, providing an accurate, up-to-date measure of the growth of novel protein structural data.

Keywords

This publication has 42 references indexed in Scilit:

Comprehensive Evaluation of Protein Structure Alignment Methods: Scoring by Geometric Measures
Journal of Molecular Biology, 2004
Toward Consistent Assignment of Structural Domains in Proteins
Journal of Molecular Biology, 2004
Estimating the number of protein folds and families from complete genome data
Journal of Molecular Biology, 2000
The Protein Data Bank
Nucleic Acids Research, 2000
Estimating the number of protein folds
Journal of Molecular Biology, 1998
Gapped BLAST and PSI-BLAST: a new generation of protein database search programs
Nucleic Acids Research, 1997
CATH – a hierarchic classification of protein domain structures
Structure, 1997
SCOP: A structural classification of proteins database for the investigation of sequences and structures
Journal of Molecular Biology, 1995
One thousand families for the molecular biologist
Nature, 1992
A possible three-dimensional structure of bovine α-lactalbumin based on that of hen's egg-white lysozyme
Journal of Molecular Biology, 1969

Cited by 132 articles