Early Detection of Adverse Drug Reactions in Social Health Networks: A Natural Language Processing Pipeline for Signal Detection

Abstract
Journal of Medical Internet Research - International Scientific Journal for Medical Research, Information and Communication on the Internet #Preprint #PeerReviewMe: Warning: This is a unreviewed preprint. Readers are warned that the document has not been peer-reviewed by expert/patient reviewers or an academic editor, may contain misleading claims, and is likely to undergo changes before final publication, if accepted, or may have been rejected/withdrawn. Readers with interest and expertise are encouraged to sign up as peer-reviewer, if the paper is within an open peer-review period. Please cite this preprint only for review purposes or for grant applications and CVs (if you are the author). Background: Adverse drug reactions (ADRs) occur in nearly all patients on chemotherapy, causing morbidity and therapy disruptions. Detection of such of ADRs is limited in clinical trials, which are underpowered to detect rare events. Early recognition of ADRs in the post-marketing phase could substantially reduce morbidity and decrease societal costs. Internet community health forums provide a mechanism for individuals to discuss real-time health concerns and can enable computational detection of ADRs. Objective: To identify cutaneous ADR signals in social health networks and compare the frequency and timing of these ADRs to clinical reports in the literature. Methods: We present a natural language processing (NLP) based ADR signal generation pipeline based on patient posts on internet social health networks. We identify user posts from Inspire health forum related to two chemotherapy classes: erlotinib, an epidermal growth factor receptor inhibitor, and nivolumab and pembrolizumab, immune checkpoint inhibitors. We extract mentions of ADRs from unstructured content of patient posts. We then perform population-level association analyses and time-to-detection analyses. Results: Our system detected cutaneous ADRs from patient reports with high precision (0.90) and at frequencies comparable to those documented in the literature, but an average of 7 months ahead of their literature reporting. Known ADRs were associated with higher proportional reporting ratios compared to negative controls, demonstrating the robustness of our analyses. Our named entity recognition system achieved 0.738 micro-averaged F-measure in detecting ADR entities (not limited to the cutaneous ADRs) in health forum posts. Additionally, we discovered the novel ADR of hypohidrosis reported by 23 patients in erlotinib related posts; this ADR was absent from 15 years of literature on this medication and we recently reported the finding in a clinical oncology journal. Conclusions: Several hundred million patients report health concerns in social health networks, yet this information is markedly underutilized for pharmacosurveillance. We demonstrate the ability of an NLP-based signal generation pipeline to accurately detect patient reports of ADRs months in advance of literature reporting, and the robustness of statistical analyses to validate system detections. Our findings suggest the important contributions that social health network data can play in contributing to more comprehensive and timely pharmacovigilance.