Using Electronic Patient Records to Discover Disease Correlations and Stratify Patient Cohorts
Open Access
- 25 August 2011
- journal article
- research article
- Published by Public Library of Science (PLoS) in PLoS Computational Biology
- Vol. 7 (8), e1002141
- https://doi.org/10.1371/journal.pcbi.1002141
Abstract
Electronic patient records remain a rather unexplored, but potentially rich data source for discovering correlations between diseases. We describe a general approach for gathering phenotypic descriptions of patients from medical records in a systematic and non-cohort dependent manner. By extracting phenotype information from the free-text in such records we demonstrate that we can extend the information contained in the structured record data, and use it for producing fine-grained patient stratification and disease co-occurrence statistics. The approach uses a dictionary based on the International Classification of Disease ontology and is therefore in principle language independent. As a use case we show how records from a Danish psychiatric hospital lead to the identification of disease correlations, which subsequently can be mapped to systems biology frameworks. Text mining and information extraction can be seen as the challenge of converting information hidden in text into manageable data. We have used text mining to automatically extract clinically relevant terms from 5543 psychiatric patient records and map these to disease codes in the International Classification of Disease ontology (ICD10). Mined codes were supplemented by existing coded data. For each patient we constructed a phenotypic profile of associated ICD10 codes. This allowed us to cluster patients together based on the similarity of their profiles. The result is a patient stratification based on more complete profiles than the primary diagnosis, which is typically used. Similarly we investigated comorbidities by looking for pairs of disease codes cooccuring in patients more often than expected. Our high ranking pairs were manually curated by a medical doctor who flagged 93 candidates as interesting. For a number of these we were able to find genes/proteins known to be associated with the diseases using the OMIM database. The disease-associated proteins allowed us to construct protein networks suspected to be involved in each of the phenotypes. Shared proteins between two associated diseases might provide insight to the disease comorbidity.Keywords
This publication has 71 references indexed in Scilit:
- Mayo clinical Text Analysis and Knowledge Extraction System (cTAKES): architecture, component evaluation and applicationsJournal of the American Medical Informatics Association, 2010
- Electronic medical records for discovery research in rheumatoid arthritisArthritis Care & Research, 2010
- Robust Replication of Genotype-Phenotype Associations across Multiple Diseases in an Electronic Medical RecordAmerican Journal of Human Genetics, 2010
- PheWAS: demonstrating the feasibility of a phenome-wide scan to discover gene–disease associationsBioinformatics, 2010
- The impact of cellular networks on disease comorbidityMolecular Systems Biology, 2009
- A large-scale analysis of tissue-specific pathology and gene expression of human disease genes and complexesProceedings of the National Academy of Sciences of the United States of America, 2008
- Electronic Health Records in Ambulatory Care — A National Survey of PhysiciansThe New England Journal of Medicine, 2008
- The human disease networkProceedings of the National Academy of Sciences of the United States of America, 2007
- Literature mining for the biologist: from information retrieval to biological discoveryNature Reviews Genetics, 2006
- Towards a proteome-scale map of the human protein–protein interaction networkNature, 2005