Database Citation in Full Text Biomedical Articles
Open Access
- 29 May 2013
- journal article
- research article
- Published by Public Library of Science (PLoS) in PLOS ONE
- Vol. 8 (5), e63184
- https://doi.org/10.1371/journal.pone.0063184
Abstract
Molecular biology and literature databases represent essential infrastructure for life science research. Effective integration of these data resources requires that there are structured cross-references at the level of individual articles and biological records. Here, we describe the current patterns of how database entries are cited in research articles, based on analysis of the full text Open Access articles available from Europe PMC. Focusing on citation of entries in the European Nucleotide Archive (ENA), UniProt and Protein Data Bank, Europe (PDBe), we demonstrate that text mining doubles the number of structured annotations of database record citations supplied in journal articles by publishers. Many thousands of new literature-database relationships are found by text mining, since these relationships are also not present in the set of articles cited by database records. We recommend that structured annotation of database records in articles is extended to other databases, such as ArrayExpress and Pfam, entries from which are also cited widely in the literature. The very high precision and high-throughput of this text-mining pipeline makes this activity possible both accurately and at low cost, which will allow the development of new integrated data services.Keywords
This publication has 12 references indexed in Scilit:
- Improving links between literature and biological data with text mining: a case study with GEO, PDB and MEDLINEDatabase: The Journal of Biological Databases and Curation, 2012
- Extraction of data deposition statements from the literature: a method for automatically tracking research resultsBioinformatics, 2011
- Annotating genes and genomes with DNA sequences extracted from biomedical articlesBioinformatics, 2011
- UKPMC: a full text article resource for the life sciencesNucleic Acids Research, 2010
- The Pfam protein families databaseNucleic Acids Research, 2009
- BioLit: integrating biological literature with databasesNucleic Acids Research, 2008
- Text processing through Web services: calling WhatizitBioinformatics, 2007
- ArrayExpress--a public repository for microarray gene expression data at the EBINucleic Acids Research, 2004
- The InterPro database, an integrated documentation resource for protein families, domains and functional sitesNucleic Acids Research, 2001
- The EMBL data libraryNucleic Acids Research, 1988