Using Electronic Patient Records to Discover Disease Correlations and Stratify Patient Cohorts

Open Access

25 August 2011

journal article
research article
Published by Public Library of Science (PLoS) in PLoS Computational Biology

Vol. 7 (8), e1002141
https://doi.org/10.1371/journal.pcbi.1002141

Abstract

Electronic patient records remain a rather unexplored, but potentially rich data source for discovering correlations between diseases. We describe a general approach for gathering phenotypic descriptions of patients from medical records in a systematic and non-cohort dependent manner. By extracting phenotype information from the free-text in such records we demonstrate that we can extend the information contained in the structured record data, and use it for producing fine-grained patient stratification and disease co-occurrence statistics. The approach uses a dictionary based on the International Classification of Disease ontology and is therefore in principle language independent. As a use case we show how records from a Danish psychiatric hospital lead to the identification of disease correlations, which subsequently can be mapped to systems biology frameworks. Text mining and information extraction can be seen as the challenge of converting information hidden in text into manageable data. We have used text mining to automatically extract clinically relevant terms from 5543 psychiatric patient records and map these to disease codes in the International Classification of Disease ontology (ICD10). Mined codes were supplemented by existing coded data. For each patient we constructed a phenotypic profile of associated ICD10 codes. This allowed us to cluster patients together based on the similarity of their profiles. The result is a patient stratification based on more complete profiles than the primary diagnosis, which is typically used. Similarly we investigated comorbidities by looking for pairs of disease codes cooccuring in patients more often than expected. Our high ranking pairs were manually curated by a medical doctor who flagged 93 candidates as interesting. For a number of these we were able to find genes/proteins known to be associated with the diseases using the OMIM database. The disease-associated proteins allowed us to construct protein networks suspected to be involved in each of the phenotypes. Shared proteins between two associated diseases might provide insight to the disease comorbidity.

Keywords

This publication has 71 references indexed in Scilit:

Mayo clinical Text Analysis and Knowledge Extraction System (cTAKES): architecture, component evaluation and applications
Journal of the American Medical Informatics Association, 2010
Electronic medical records for discovery research in rheumatoid arthritis
Arthritis Care & Research, 2010
Robust Replication of Genotype-Phenotype Associations across Multiple Diseases in an Electronic Medical Record
American Journal of Human Genetics, 2010
PheWAS: demonstrating the feasibility of a phenome-wide scan to discover gene–disease associations
Bioinformatics, 2010
The impact of cellular networks on disease comorbidity
Molecular Systems Biology, 2009
A large-scale analysis of tissue-specific pathology and gene expression of human disease genes and complexes
Proceedings of the National Academy of Sciences of the United States of America, 2008
Electronic Health Records in Ambulatory Care — A National Survey of Physicians
The New England Journal of Medicine, 2008
The human disease network
Proceedings of the National Academy of Sciences of the United States of America, 2007
Literature mining for the biologist: from information retrieval to biological discovery
Nature Reviews Genetics, 2006
Towards a proteome-scale map of the human protein–protein interaction network
Nature, 2005

Cited by 232 articles