Refine Search

New Search

Results in Journal Journal of Biomedical Informatics: 3,027

(searched for: journal_id:(843427))
Page of 61
Articles per Page
Show export options
  Select all
Rezarta Islamaj, Chih-Hsuan Wei, David Cissel, Nicholas Miliaras, Olga Printseva, Oleg Rodionov, Keiko Sekiya, Janice Ward,
Journal of Biomedical Informatics, Volume 118; doi:10.1016/j.jbi.2021.103779

The automatic recognition of gene names and their corresponding database identifiers in biomedical text is an important first step for many downstream text-mining applications. While current methods for tagging gene entities have been developed for biomedical literature, their performance on species other than human is substantially lower due to the lack of annotation data. We therefore present the NLM-Gene corpus, a high-quality manually annotated corpus for genes developed at the US National Library of Medicine (NLM), covering ambiguous gene names, with an average of 29 gene mentions (10 unique identifiers) per document, and a broader representation of different species (including Homo sapiens, Mus musculus, Rattus norvegicus, Drosophila melanogaster, Arabidopsis thaliana, Danio rerio, etc.) when compared to previous gene annotation corpora. NLM-Gene consists of 550 PubMed abstracts from 156 biomedical journals, doubly annotated by six experienced NLM indexers, randomly paired for each document to control for bias. The annotators worked in three annotation rounds until they reached complete agreement. This gold-standard corpus can serve as a benchmark to develop & test new gene text mining algorithms. Using this new resource, we have developed a new gene finding algorithm based on deep learning which improved both on precision and recall from existing tools. The NLM-Gene annotated corpus is freely available at We have also applied this tool to the entire PubMed/PMC with their results freely accessible through our web-based tool PubTator (
Journal of Biomedical Informatics, Volume 118; doi:10.1016/j.jbi.2021.103781

To differentiate between conditions of health and disease, current pathway enrichment analysis methods detect the differential expression of distinct biological pathways. System-level model-driven approaches, however, are lacking. Here we present a new methodology that uses a dynamic model to suggest a unified subsystem to better differentiate between diseased and healthy conditions. Our methodology includes the following steps: 1) detecting connections between relevant differentially expressed pathways; 2) construction of a unified in silico model, a stochastic Petri net model that links these distinct pathways; 3) model execution to predict subsystem activation; and 4) enrichment analysis of the predicted subsystem. We apply our approach to the TGF-beta regulation of the autophagy system implicated in autism. Our model was constructed manually, based on the literature, to predict, using model simulation, the TGF-beta-to-autophagy active subsystem and downstream gene expression changes associated with TGF-beta, which go beyond the individual findings derived from literature. We evaluated the in silico predicted subsystem and found it to be co-expressed in the normative whole blood human gene expression data. Finally, we show our subsystem’s gene set to be significantly differentially expressed in two independent datasets of blood samples of ASD (autistic spectrum disorders) individuals as opposed to controls. Our study demonstrates that dynamic pathway unification can define a new refined subsystem that can significantly differentiate between disease conditions.
, Thakir M. Mohsin, Dhiya Al-Jumeily, Mohamed Alloghani
Journal of Biomedical Informatics, Volume 118, pp 103766-103766; doi:10.1016/j.jbi.2021.103766

Iraq is among the countries affected by the COVID-19 pandemic. As of 2 August 2020, 129,151 COVID-19 cases were confirmed, including 91,949 recovered cases and 4,867 deaths. After the announcement of lockdown in early April 2020, situation in Iraq was getting steady until late May 2020, when daily COVID-19 infections have raised suddenly due to gradual easing of lockdown restrictions. In this context, it is important to develop a forecasting model to evaluate the COVID-19 outbreak in Iraq and so to guide future health policy. COVID-19 lag data were made available by the University of Anbar through their online analytical platform (, engaged with the day-to-day figures form the Iraqi health authorities. 154 days of patient data were provided covering the period from 2 March 2020 to 2 August 2020. An ensemble of feed-forward neural networks has been adopted to forecast COVID-19 outbreak in Iraq. Also, this study highlights some key questions about this pandemic using data analytics. Forecasting were achieved with accuracy of 87.6% for daily infections, 82.4% for daily recovered cases, and 84.3% for daily deaths. It is anticipated that COVID-19 infections in Iraq will reach about 308,996 cases by the end of September 2020, including 228,551 to recover and 9,477 deaths. The applications of artificial neural networks supported by advanced data analytics represent a promising solution through which to realise intelligent solutions, enabling the space of analytical operations to drive a national health policy to contain COVID-19 pandemic.
Edward J. Schenck, Katherine L. Hoffman, Marika Cusick, Joseph Kabariti, Evan T. Sholle, Thomas R. Campion
Journal of Biomedical Informatics, Volume 118; doi:10.1016/j.jbi.2021.103789

Patients treated in an intensive care unit (ICU) are critically ill and require life-sustaining organ failure support. Existing critical care data resources are limited to a select number of institutions, contain only ICU data, and do not enable the study of local changes in care patterns. To address these limitations, we developed the Critical carE Database for Advanced Research (CEDAR), a method for automating extraction and transformation of data from an electronic health record (EHR) system. Compared to an existing gold standard of manually collected data at our institution, CEDAR was statistically similar in most measures, including patient demographics and sepsis-related organ failure assessment (SOFA) scores. Additionally, CEDAR automated data extraction obviated the need for manual collection of 550 variables. Critically, during the spring 2020 COVID-19 surge in New York City, a modified version of CEDAR supported pandemic response efforts, including clinical operations and research. Other academic medical centers may find value in using the CEDAR method to automate data extraction from EHR systems to support ICU activities.
Journal of Biomedical Informatics, Volume 118; doi:10.1016/j.jbi.2021.103778

Leveraging the Electronic Health Records (EHR) longitudinal data to produce actionable clinical insights has always been a critical issue for recent studies. Non-forecasted extended hospitalizations account for a disproportionate amount of resource use, the mediocre quality of inpatient care, and avoidable fatalities. The capability to predict the Length of Stay (LoS) and mortality in the early stages of the admission provides opportunities to improve care and prevent many preventable losses. Forecasting the in-hospital mortality is important in providing clinicians with enough insights to make decisions and hospitals to allocate resources, hence predicting the LoS and mortality within the first day of admission is a difficult but a paramount endeavor. The biggest challenge is that few data are available by this time, thus the prediction has to bring in the previous admissions history and free text diagnosis that are recorded immediately on admission. We propose a model that uses the multi-modal EHR structured medical codes and key demographic information to classify the LoS in 3 classes; Short Los (LoS⩽10 days), Medium LoS (1030 days) as well as mortality as a binary classification of a patient’s death during current admission. The prediction has to use data available only within 24 h of admission. The key predictors include previous ICD9 diagnosis codes, ICD9 procedures, key demographic data, and free text diagnosis of the current admission recorded right on admission. We propose a Hierarchical Attention Network (HAN-LoS and HAN-Mor) model and train it to a dataset of over 45321 admissions recorded in the de-identified MIMIC-III dataset. For improved prediction, our attention mechanisms can focus on the most influential past admissions and most influential codes in these admissions. For fair performance evaluation, we implemented and compared the HAN model with previous approaches. With dataset balancing techniques HAN-LoS achieved an AUROC of over 0.82 and a Micro-F1 score of 0.24 and HAN-Mor achieved AUC-ROC of 0.87 hence outperforming the existing baselines that use structured medical codes as well as clinical time series for LoS and Mortality forecasting. By predicting mortality and LoS using the same model, we show that with little tuning the proposed model can be used for other clinical predictive tasks like phenotyping, decompensation,re-admission prediction, and survival analysis.
, Lawrence J. Babb, Casey Overby Taylor, Luke V. Rasmussen, Robert R. Freimuth, Eric Venner, Fei Yan, Victoria Yi, Stephen J. Granite, Hana Zouk, et al.
Journal of Biomedical Informatics, Volume 118; doi:10.1016/j.jbi.2021.103795

Structured representation of clinical genetic results is necessary for advancing precision medicine. The Electronic Medical Records and Genomics (eMERGE) Network’s Phase III program initially used a commercially developed XML message format for standardized and structured representation of genetic results for electronic health record (EHR) integration. In a desire to move towards a standard representation, the network created a new standardized format based upon Health Level Seven Fast Healthcare Interoperability Resources (HL7® FHIR®), to represent clinical genomics results. These new standards improve the utility of HL7® FHIR® as an international healthcare interoperability standard for management of genetic data from patients. This work advances the establishment of standards that are being designed for broad adoption in the current health information technology landscape.
Cong Sun, , , Yin Zhang, Hongfei Lin, Jian Wang
Journal of Biomedical Informatics, Volume 118; doi:10.1016/j.jbi.2021.103799

Recognition of biomedical entities from literature is a challenging research focus, which is the foundation for extracting a large amount of biomedical knowledge existing in unstructured texts into structured formats. Using the sequence labeling framework to implement biomedical named entity recognition (BioNER) is currently a conventional method. This method, however, often cannot take full advantage of the semantic information in the dataset, and the performance is not always satisfactory. In this work, instead of treating the BioNER task as a sequence labeling problem, we formulate it as a machine reading comprehension (MRC) problem. This formulation can introduce more prior knowledge utilizing well-designed queries, and no longer need decoding processes such as conditional random fields (CRF). We conduct experiments on six BioNER datasets, and the experimental results demonstrate the effectiveness of our method. Our method achieves state-of-the-art (SOTA) performance on the BC4CHEMD, BC5CDR-Chem, BC5CDR-Disease, NCBI-Disease, BC2GM and JNLPBA datasets, achieving F1-scores of 92.92%, 94.19%, 87.83%, 90.04%, 85.48% and 78.93%, respectively.
Jayanta Kumar Das, Subhadip Chakraborty,
Journal of Biomedical Informatics, Volume 118, pp 103801-103801; doi:10.1016/j.jbi.2021.103801

Understanding the molecular mechanism of COVID-19 pathogenesis helps in the rapid therapeutic target identification. Usually, viral protein targets host proteins in an organized fashion. The expression of any viral gene depends mostly on the host translational machinery. Recent studies report the great significance of codon usage biases in establishing host-viral protein–protein interactions (PPI). Exploring the codon usage patterns between a pair of co-evolved host and viral proteins may present novel insight into the host-viral protein interactomes during disease pathogenesis. Leveraging the similarity in codon usage patterns, we propose a computational scheme to recreate the host-viral protein–protein interaction network. We use host proteins from seventeen (17) essential signaling pathways for our current work towards understanding the possible targeting mechanism of SARS-CoV-2 proteins. We infer both negatively and positively interacting edges in the network. Further, extensive analysis is performed to understand the host PPI network topologically and the attacking behavior of the viral proteins. Our study reveals that viral proteins mostly utilize codons, rare in the targeted host proteins (negatively correlated interaction). Among them, non-structural proteins, NSP3 and structural protein, Spike (S), are the most influential proteins in interacting with multiple host proteins. While ranking the most affected pathways, MAPK pathways observe to be the worst affected during the SARS-CoV-2 infection. Several proteins participating in multiple pathways are highly central in host PPI and mostly targeted by multiple viral proteins. We observe many potential targets (host proteins) from the affected pathways associated with the various drug molecules, including Arsenic trioxide, Dexamethasone, Hydroxychloroquine, Ritonavir, and Interferon beta, which are either under clinical trial or in use during COVID-19.
Journal of Biomedical Informatics, Volume 118; doi:10.1016/s1532-0464(21)00156-8

Journal of Biomedical Informatics, Volume 118; doi:10.1016/s1532-0464(21)00157-x

Shilpa Sethi, , Trilok Kaushik
Journal of Biomedical Informatics, Volume 120; doi:10.1016/j.jbi.2021.103848

Effective strategies to restrain COVID-19 pandemic need high attention to mitigate negatively impacted communal health and global economy, with the brim-full horizon yet to unfold. In the absence of effective antiviral and limited medical resources, many measures are recommended by WHO to control the infection rate and avoid exhausting the limited medical resources. Wearing a mask is among the non-pharmaceutical intervention measures that can be used to cut the primary source of SARS-CoV2 droplets expelled by an infected individual. Regardless of discourse on medical resources and diversities in masks, all countries are mandating coverings over the nose and mouth in public. To contribute towards communal health, this paper aims to devise a highly accurate and real-time technique that can efficiently detect non-mask faces in public and thus, enforcing to wear mask. The proposed technique is ensemble of one-stage and two-stage detectors to achieve low inference time and high accuracy. We start with ResNet50 as a baseline and applied the concept of transfer learning to fuse high-level semantic information in multiple feature maps. In addition, we also propose a bounding box transformation to improve localization performance during mask detection. The experiment is conducted with three popular baseline models viz. ResNet50, AlexNet and MobileNet. We explored the possibility of these models to plug-in with the proposed model so that highly accurate results can be achieved in less inference time. It is observed that the proposed technique achieves high accuracy (98.2%) when implemented with ResNet50. Besides, the proposed model generates 11.07% and 6.44% higher precision and recall in mask detection when compared to the recent public baseline model published as RetinaFaceMask detector. The outstanding performance of the proposed model is highly suitable for video surveillance devices.
Laura Evans, , Matvey B. Palchuk
Journal of Biomedical Informatics, Volume 119; doi:10.1016/j.jbi.2021.103847

Analysis of healthcare Real-World Data (RWD) provides an opportunity to observe actual patient diagnostic, treatment and outcomes events. However, researchers should understand the possible limitations of RWD. In particular, these data may be incomplete, which would affect the validity of study conclusions. The completeness of medication RWD was investigated by analyzing the incidence of various diagnosis-medication couplets: the occurrence of a certain medication in the RWD for a patient having a certain diagnosis. Diagnosis and medication data were obtained from 61 U.S. medical data provider organizations, members of the TriNetX global research network. The number of patients having 22 diagnoses and expected medications were obtained at each institution, and the percent completion of each diagnosis-medication couplet calculated. The study hypothesis is that the degree of couplet completeness can serve as a proxy for overall completeness of medication data for a given organization. Five diagnosis-medication couplets were found to be reliable proxies, having at least a peak 87% observed completeness for the organizations studied: Type 1 diabetes mellitus and insulin; asthma and albuterol; congestive heart failure and diuretics; cardiovascular disease and aspirin; hypothyroidism and levothyroxine. These couplets were validated as reliable indicators by determining their status as standards of care. The degree to which patients with these five diagnoses had the specified associated medication was consistent within an organization data set. The overall degree of medication data completeness for an organization can be assessed by measuring the completeness of certain indicator diagnosis-medication couplets.
Jiaonan Ma, Xueli Xu, Mengdi Li, Yan Zhang, Lin Zhang, Ping Ma, Jie Hou, Yulin Lei, Jianguo Liu, Xiaojin Huangfu, et al.
Journal of Biomedical Informatics, Volume 120; doi:10.1016/j.jbi.2021.103855

Aging is a major risk factor for various eye diseases, such as cataract, glaucoma, and age-related macular degeneration. Age-related changes are observed in almost all structures of the human eye. Considerable individual variations exist within a group of similarly aged individuals, indicating the need for more informative biomarkers for assessing the aging of the eyes. The morphology of the ocular anterior segment has been reported to vary across age groups, focusing on only a few corneal parameters, such as keratometry and thickness of the cornea, which could not provide accurate estimation of age. Thus, the association between eye aging and the morphology of the anterior segment remains elusive. In this study, we aimed to develop a predictive model of age based on a large number of anterior segment morphology-related features, measured via the high-resolution ocular anterior segment analysis system (Pentacam). This approach allows for an integrated assessment of age-related changes in corneal morphology, and the identification of important morphological features associated with different eye aging patterns. Three machine learning methods (neural networks, Lasso regression and extreme gradient boosting) were employed to build predictive models using 276 anterior segment features of 63,753 participants from 10 ophthalmic centers in 10 different cities of China. The best performing age prediction model achieved a median absolute error of 2.80 years and a mean absolute error of 3.89 years in the validation set. An external cohort of 100 volunteers was used to test the performance of the prediction model. The developed neural network model achieved a median absolute error of 3.03 years and a mean absolute error of 3.40 years in the external cohort. In summary, our study revealed that the anterior segment morphology of the human eye may be an informative and non-invasive indicator of eye aging. This could prompt doctors to focus on age-related medical interventions on ocular health.
, , Laurent Meesseman, Jos De Roo, Martijn Vanbiervliet, Jos De Baerdemaeker, Herman Muys, , ,
Journal of Biomedical Informatics, Volume 118; doi:10.1016/j.jbi.2021.103783

Machine learning (ML) algorithms are now widely used in predicting acute events for clinical applications. While most of such prediction applications are developed to predict the risk of a particular acute event at one hospital, few efforts have been made in extending the developed solutions to other events or to different hospitals. We provide a scalable solution to extend the process of clinical risk prediction model development of multiple diseases and their deployment in different Electronic Health Records (EHR) systems. We defined a generic process for clinical risk prediction model development. A calibration tool has been created to automate the model generation process. We applied the model calibration process at four hospitals, and generated risk prediction models for delirium, sepsis and acute kidney injury (AKI) respectively at each of these hospitals. The delirium risk prediction models have on average an area under the receiver-operating characteristic curve (AUROC) of 0.82 at admission and 0.95 at discharge on the test datasets of the four hospitals. The sepsis models have on average an AUROC of 0.88 and 0.95, and the AKI models have on average an AUROC of 0.85 and 0.92, at the day of admission and discharge respectively. The scalability discussed in this paper is based on building common data representations (syntactic interoperability) between EHRs stored in different hospitals. Semantic interoperability, a more challenging requirement that different EHRs share the same meaning of data, e.g. a same lab coding system, is not mandated with our approach. Our study describes a method to develop and deploy clinical risk prediction models in a scalable way. We demonstrate its feasibility by developing risk prediction models for three diseases across four hospitals.
Journal of Biomedical Informatics, Volume 117; doi:10.1016/j.jbi.2021.103752

The detection of medical abuse is essential because medical abuse imposes extra payments on individual insurance fees and increases unnecessary social costs. To reduce the costs due to medical abuse, insurance companies hire medical experts who examine claims, suspected to arise as a result of overtreatment from institutions, and review the suitability of claimed treatments. Owing to the limited number of reviewers and mounting volume of claims, there is need for a comprehensive method to detect medical abuse that uses a scoring model that selects a few institutions to be investigated. Numerous studies for detecting medical abuse have focused on institution-level variables such as the average values of hospitalization period and medical expenses to find the abuse score and selected institutions based on it. However, these studies use simple variables to construct a model that has poor performance with regard to detecting complex abuse billing patterns. Institution-level variables could easily represent the characteristics of institutions but loss of information is inevitable. Hence, it is possible to reduce information loss by using the finest granularity of data with treatment-level variables. In this study, we develop a scoring model by using treatment-level information and it is first of its kind to use a patient classification system (PCS) to improve the detection performance of medical abuse. PCS is a system that classifies patients in terms of clinical significance and consumption of medical resources. Because PCS is based on diagnosis, the patients grouped according to PCS tend to suffer from similar diseases. Claim data segmented by PCS is composed of patients with fewer types of diseases; hence, the data distribution by PCS is more homogeneous than data classified with respect to medical departments. We define an abusive institution as an institution having numerous number of abused treatments and containing their large sum of the abuse amounts, and the main idea of our model is that the abuse score of an institution is approximated as the sum of abuse scores for all treatments claimed from the institution. The proposed method consists of two steps: training a binary classification model to predict the abusiveness of each treatment and yielding an abuse score for each institution by aggregating the predicted abusiveness. The resulting abuse score is used to prioritize institutions to investigate. We tested the performance of our model against the scoring model employed by the insurance review agency in South Korea, making use of the real world claim data submitted to the agency. We compared these models with efficiency which represents the extent to which the model may detect the abused amounts per treatment. Experimental results show that the proposed model has efficiency up to 3.57 times higher than the model employed by the agency. In addition, we put forward an efficient and realistic reviewing process when the proposed scoring model is applied to the existing process. The proposed process has efficiency up to 2.17 times higher than the existing process.
, Jonathan R. Brestoff, Ronald Jackups
Journal of Biomedical Informatics, Volume 117; doi:10.1016/j.jbi.2021.103756

Clinicians order laboratory tests in an effort to reduce diagnostic or therapeutic uncertainty. Information theory provides the opportunity to quantify the degree to which a test result is expected to reduce diagnostic uncertainty. We sought to apply information theory toward the evaluation and optimization of a diagnostic test threshold and to determine if the results would differ from those of conventional methodologies. We used a heparin/PF4 immunoassay (PF4 ELISA) as a case study. The laboratory database was queried for PF4 ELISA and serotonin release assay (SRA) results during the study period, with the latter serving as the gold standard for the disease heparin-induced thrombocytopenia (HIT). The optimized diagnostic threshold of the PF4 ELISA test was compared using conventional versus information theoretic approaches under idealized (pretest probability = 50%) and realistic (pretest probability = 2.4%) testing conditions. Under ideal testing conditions, both analyses yielded a similar optimized optical density (OD) threshold of OD > 0.79. Under realistic testing conditions, information theory suggested a higher threshold, OD > 1.5 versus OD > 0.6. Increasing the diagnostic threshold improved the global information value, the value of a positive test and the noise content with only a minute change in the negative test value. Our information theoretic approach suggested that the current FDA approved cutoff (OD > 0.4) is overly permissive leading to loss of test value and injection of noise into an already complex diagnostic dilemma. Because our approach is purely statistical and takes as input data that are readily accessible in the clinical laboratory it offers a scalable and data-driven strategy for optimizing test value that may be widely applicable in the domain of laboratory medicine. Information theory provides more meaningful measures of test value than the widely used accuracy-based metrics.
Lauren L. Staples, Morgan Tamayo, Bryan D. Yockey, Jessica M. Rudd, Nicole Hill, , Herman E. Ray, Joe DeMaio
Journal of Biomedical Informatics, Volume 117; doi:10.1016/j.jbi.2021.103759

Value-based healthcare in the US is a payment structure that ties reimbursement to quality rather than volume alone. One model of value-based care is the Tennessee Division of TennCare’s Episodes of Care program, which groups common health conditions into episodes using specified time windows, medical code sets and quality metrics as defined in each episode’s Detailed Business Requirements [1], [2]. Tennessee’s program assigns responsibility for an episode to a managing physician, presenting a unique opportunity to study physician variability in cost and quality within these structured episodes. This paper proposes a pipeline for analysis demonstrated using a cohort of 599 Outpatient and Non-Acute Inpatient Cholecystectomy episodes managed by BlueCross BlueShield of Tennessee in 2016. We sorted episode claims by date of service, then calculated the pairwise Levenshtein distance between all episodes. Next, we adjusted the resulting matrix by cost dissimilarity and performed agglomerative clustering. We then examined the lowest and highest average episode cost clusters for patterns in cost and quality. Our results indicate that the facility type where the surgery takes place is important: outpatient ambulatory care center for the lowest cost cluster, and hospital operating room for the highest cost cluster. Average patient risk scores were higher in the highest cost cluster than the lowest cost cluster. Readmission rate (a quality metric tied to managing physician performance) was low for the whole cohort. Lastly, we explain how our analytical pipeline can be generalized and extended to domains beyond Episodes of Care.
Mervin Joe Thomas, Vishnu Lal, Ajith Kurian Baby, Muhammad Rabeeh Vp, Alosh James,
Journal of Biomedical Informatics, Volume 117, pp 103787-103787; doi:10.1016/j.jbi.2021.103787

The COVID-19 pandemic is continuing, and the innovative and efficient contributions of the emerging modern technologies to the pandemic responses are too early and cannot be completely quantified at this moment. Digital technologies are not a final solution but are the tools that facilitate a quick and effective pandemic response. In accordance, mobile applications, robots and drones, social media platforms (such as search engines, Twitter, and Facebook), television, and associated technologies deployed in tackling the COVID-19 (SARS-CoV-2) outbreak are discussed adequately, emphasizing the current-state-of-art. A collective discussion on reported literature, press releases, and organizational claims are reviewed. This review addresses and highlights how these effective modern technological solutions can aid in healthcare (involving contact tracing, real-time isolation monitoring/screening, disinfection, quarantine enforcement, syndromic surveillance, and mental health), communication (involving remote assistance, information sharing, and communication support), logistics, tourism, and hospitality. The study discusses the benefits of these digital technologies in curtailing the pandemic and ‘how’ the different sectors adapted to these in a shorter period. Social media and television’s role in ensuring global connectivity and serving as a common platform to share authentic information among the general public were summarized. The World Health Organization and Governments’ role globally in-line with the prevention of propagation of false news, spreading awareness, and diminishing the severity of the COVID-19 was discussed. Furthermore, this collective review is helpful to investigators, health departments, Government organizations, and policymakers alike to facilitate a quick and effective pandemic response.
Journal of Biomedical Informatics, Volume 117; doi:10.1016/j.jbi.2021.103757

This work presents a detailed and complete review of publications on pupillary light reflex (PLR) used to aid diagnoses. These are computational techniques used in the evaluation of pupillometry, as well as their application in computer-aided diagnoses (CAD) of pathologies or physiological conditions that can be studied by observing the movements of miosis and mydriasis of the human pupil. A careful survey was carried out of all studies published over the last 10 years which investigated, electronic devices, recording protocols, image treatment, computational algorithms and the pathologies related to PLR. We present the frontier of existing knowledge regarding methods and techniques used in this field of knowledge, which has been expanding due to the possibility of performing diagnoses with high precision, at a low cost and with a non-invasive method.
Jianfei Cui, He Zhu, Hao Deng, ,
Journal of Biomedical Informatics, Volume 117; doi:10.1016/j.jbi.2021.103735

Electrical medical records are restricted and difficult to centralize for machine learning model training due to privacy and regulatory issues. One solution is to train models in a distributed manner that involves many parties in the process. However, sometimes certain parties are not trustable, and in this project, we aim to propose an alternative method to traditional federated learning with central analyzer in order to conduct training in a situation without a trustable central analyzer. The proposed algorithm is called “federated machine learning with anonymous random hybridization (abbreviated as ‘FeARH’)”, using mainly hybridization algorithm to degenerate the integration of connections between medical record data and models’ parameters by adding randomization into the parameter sets shared to other parties. Based on our experiment, our new algorithm has similar AUCROC and AUCPR results compared with machine learning in a centralized manner and original federated machine learning.
, J.A. Maldonado, D. Boscá, S. Salas-García, M. Robles
Journal of Biomedical Informatics, Volume 117; doi:10.1016/j.jbi.2021.103747

SNOMED CT Expression Constraint Language (ECL) is a declarative language developed by SNOMED International for the definition of SNOMED CT Expression Constraints (ECs). ECs are executable expressions that define intensional subsets of clinical meanings by stating constraints over the logic definition of concepts. The execution of an EC on some SNOMED CT substrate yields the intended subset, and it requires an execution engine able to receive an EC as input, execute it, and return the matching concepts. An important issue regarding subsets of clinical concepts is their use in terminology binding between clinical information models and terminologies for defining the set of valid values of codified data. To define and implement methods for the simplification, semantic validation and execution of ECs over a graph-oriented SNOMED CT database, and to provide a method for the visual representation of subsets in order to explore, understand and validate its content, as well as to develop an EC execution platform, called SNQuery, which makes use of these methods. Since SNOMED CT is a directed and acyclic graph, we have used a graph-oriented database to represent the content of SNOMED CT, where the schema and instances are represented as graphs and the data manipulation is expressed by graph-oriented operations. For the execution of ECs over the graph database, it is performed a translation process in which ECs are translated into a set of Cypher Query Language queries. We have defined some EC simplification methods that leverage the logic structure underlying SNOMED CT. The purpose of these methods is to reduce the complexity of ECs and, in turn, its execution time, as well as to validate them from a SNOMED CT Concept Model and logical definition points of view. We also have developed a graphic representation based on the circle packing geometrical concept, which allows validating subsets, as well as pre-defined refsets and the terminology itself. We have developed SNQuery, a platform for the definition of intensional subsets of SNOMED CT concepts by means of the execution of ECs over a graph-oriented SNOMED CT database. Additionally, we have incorporated methods for the simplification and semantic validation of ECs, as well as for the visualization of subsets as a mechanism to understand and validate them. SNQuery has been evaluated in terms of EC execution times. In this paper, we provide methods to simplify, semantically validate and execute ECs over a graph-oriented database. We also offer a method to visualize the intensional subsets obtained by executing ECs to explore, understand and validate them, as well as refsets and the terminology itself. The definition of intensional subsets is useful to bind content between clinical information models and clinical terminologies, which is a necessary step to achieve semantic interoperability between EHR systems.
, Sachiko Kodera, Hidenobu Shirakami, Ryotetsu Kawaguchi, Kazuhiro Watanabe, Akimasa Hirata
Journal of Biomedical Informatics, Volume 117; doi:10.1016/j.jbi.2021.103743

Accurate forecasting of medical service requirements is an important big data problem that is crucial for resource management in critical times such as natural disasters and pandemics. With the global spread of coronavirus disease 2019 (COVID-19), several concerns have been raised regarding the ability of medical systems to handle sudden changes in the daily routines of healthcare providers. One significant problem is the management of ambulance dispatch and control during a pandemic. To help address this problem, we first analyze ambulance dispatch data records from April 2014 to August 2020 for Nagoya City, Japan. Significant changes were observed in the data during the pandemic, including the state of emergency (SoE) declared across Japan. In this study, we propose a deep learning framework based on recurrent neural networks to estimate the number of emergency ambulance dispatches (EADs) during a SoE. The fusion of data includes environmental factors, the localization data of mobile phone users, and the past history of EADs, thereby providing a general framework for knowledge discovery and better resource management. The results indicate that the proposed blend of training data can be used efficiently in a real-world estimation of EAD requirements during periods of high uncertainties such as pandemics.
Journal of Biomedical Informatics, Volume 117; doi:10.1016/j.jbi.2021.103764

Cancer, in particular breast cancer, is considered one of the most common causes of death worldwide according to the world health organization. For this reason, extensive research efforts have been done in the area of accurate and early diagnosis of cancer in order to increase the likelihood of cure. Among the available tools for diagnosing cancer, microarray technology has been proven to be effective. Microarray technology analyzes the expression level of thousands of genes simultaneously. Although the huge number of features or genes in the microarray data may seem advantageous, many of these features are irrelevant or redundant resulting in the deterioration of classification accuracy. To overcome this challenge, feature selection techniques are a mandatory preprocessing step before the classification process. In the paper, the main feature selection and classification techniques introduced in the literature for cancer (particularly breast cancer) are reviewed to improve the microarray-based classification.
, Jing Guo, Pei Wang, Yaowei Wang, Minghao Yu, Xiang Wang, Po Yang,
Journal of Biomedical Informatics, Volume 117, pp 103736-103736; doi:10.1016/j.jbi.2021.103736

The recent outbreak of COVID-19 has infected millions of people around the world, which is leading to the global emergency. In the event of the virus outbreak, it is crucial to get the carriers of the virus timely and precisely, then the animal origins can be isolated for further infection. Traditional identifications rely on fields and laboratory researches that lag the responses to emerging epidemic prevention. With the development of machine learning, the efficiency of predicting the viral hosts has been demonstrated by recent researchers. However, the problems of the limited annotated virus data and imbalanced hosts information restrict these approaches to obtain a better result. To assure the high reliability of predicting the animal origins on COVID-19, we extend transfer learning and ensemble learning to present a hybrid transfer learning model. When predicting the hosts of newly discovered virus, our model provides a novel solution to utilize the related virus domain as auxiliary to help building a robust model for target virus domain. The simulation results on several UCI benchmarks and viral genome datasets demonstrate that our model outperforms the general classical methods under the condition of limited target training sets and class-imbalance problems. By setting the coronavirus as target domain and other related virus as source domain, the feasibility of our approach is evaluated. Finally, we show the animal reservoirs prediction of the COVID-19 for further analysing.
Journal of Biomedical Informatics, Volume 117; doi:10.1016/j.jbi.2021.103767

Argument Mining (AM) refers to the task of automatically identifying arguments in a text and finding their relations. In medical literature this is done by identifying Claims and Premises and classifying their relations as either Support or Attack. Evidence-Based Medicine (EBM) refers to the task of identifying all related evidence in medical literature to allow medical practitioners to make informed choices and form accurate treatment plans. This is achieved through the automatic identification of Population, Intervention, Comparator and Outcome entities (PICO) in the literature to limit the collection to only the most relevant documents. In this work, we combine EBM with AM in medical literature to increase the performance of the individual models and create high quality argument graphs, annotated with PICO entities. To that end, we introduce a state-of-the-art EBM model, used to predict the PICO entities and two novel Argument Identification and Argument Relation classification models that utilize the PICO entities to enhance their performance. Our final system works in a pipeline and is able to identify all PICO entities in a medical publication, the arguments presented in them and their relations.
Journal of Biomedical Informatics, Volume 117; doi:10.1016/j.jbi.2021.103758

Protecting the privacy of patient data is an important issue. Patient data are typically protected in local health systems, but this makes integration of data from different healthcare systems difficult. To build high-performance predictive models, a large number of samples are needed, and performance measures such as calibration and discrimination are essential. While distributed algorithms for building models and measuring discrimination have been published, distributed algorithms to measure calibration and recalibrate models have not been proposed. Recalibration models have been shown to improve calibration, but they have not been proposed for data that are distributed in various health systems, or “sites”. Our goal is to measure calibration performance and build a global recalibration model using data from multiple health systems, without sharing patient-level data. We developed a distributed smooth isotonic regression recalibration model and extended established calibration measures, such as Hosmer-Lemeshow Tests, Expected Calibration Error, and Maximum Calibration Error in a distributed manner. Experiments on both simulated and clinical data were conducted, and the recalibration results produced by a traditional (ie, centralized) versus a distributed smooth isotonic regression were compared. The results were exactly the same. Our algorithms demonstrated that calibration can be improved and measured in a distributed manner while protecting data privacy, albeit at some cost in terms of computational efficiency. It also gives researchers who may have too few instances in their own institutions a method to construct robust recalibration models. Preserving data privacy and improving model calibration are both important to advancing predictive analysis in clinical informatics. The algorithms alleviate the difficulties in model building across sites.
David Cuadrado, David Riaño, Josep Gómez, Alejandro Rodríguez, María Bodí
Journal of Biomedical Informatics, Volume 117; doi:10.1016/j.jbi.2021.103768

Patients in intensive care units are heterogeneous and the daily prediction of their days to discharge (DTD) a complex task that practitioners and computers are not always able to solve satisfactorily. In order to make more precise DTD predictors, it is necessary to have tools for the analysis of the heterogeneity of the patients. Unfortunately, the number of publications in this field is almost non-existent. In order to alleviate this lack of tools, we propose four methods and their corresponding measures to quantify the heterogeneity of intensive patients in the process of determining the DTD. These new methods and measures have been tested with patients admitted over four years to a tertiary hospital in Spain. The results deepen the understanding of the intensive patient and can serve as a basis for the construction of better DTD predictors.
Amy Junghyun Lee, , Youngbin Shin, Jiwoo Lee, Hyo Jung Park, Young Chul Cho, Yousun Ko, Yu Sub Sung, Byung Sun Yoon
Journal of Biomedical Informatics, Volume 117; doi:10.1016/j.jbi.2021.103782

Major issues in imaging data management of tumor response assessment in clinical trials include high human errors in data input and unstandardized data structures, warranting a new breakthrough IT solution. Thus, we aim to develop a Clinical Data Interchange Standards Consortium (CDISC)-compliant clinical trial imaging management system (CTIMS) with automatic verification and transformation modules for implementing the CDISC Study Data Tabulation Model (SDTM) in the tumor response assessment dataset of clinical trials. In accordance with various CDISC standards guides and Response Evaluation Criteria in Solid Tumors (RECIST) guidelines, the overall system architecture of CDISC-compliant CTIMS was designed. Modules for standard-compliant electronic case report form (eCRF) to verify data conformance and transform into SDTM data format were developed by experts in diverse fields such as medical informatics, medical, and clinical trial. External validation of the CDISC-compliant CTIMS was performed by comparing it with our previous CTIMS based on real-world data and CDISC validation rules by Pinnacle 21 Community Software. The architecture of CDISC-compliant CTIMS included the standard-compliant eCRF module of RECIST, the automatic verification module of the input data, and the SDTM transformation module from the eCRF input data to the SDTM datasets based on CDISC Define-XML. This new system was incorporated into our previous CTIMS. External validation demonstrated that all 176 human input errors occurred in the previous CTIMS filtered by a new system yielding zero error and CDISC-compliant dataset. The verified eCRF input data were automatically transformed into the SDTM dataset, which satisfied the CDISC validation rules by Pinnacle 21 Community Software. To assure data consistency and high quality of the tumor response assessment data, our new CTIMS can minimize human input error by using standard-compliant eCRF with an automatic verification module and automatically transform the datasets into CDISC SDTM format.
, Tom Lawton, John Burden, John McDermid, Ibrahim Habli
Journal of Biomedical Informatics, Volume 117; doi:10.1016/j.jbi.2021.103762

Machine learning (ML) has the potential to bring significant clinical benefits. However, there are patient safety challenges in introducing ML in complex healthcare settings and in assuring the technology to the satisfaction of the different regulators. The work presented in this paper tackles the urgent problem of proactively assuring ML in its clinical context as a step towards enabling the safe introduction of ML into clinical practice. In particular, the paper considers the use of deep Reinforcement Learning, a type of ML, for sepsis treatment. The methodology starts with the modelling of a clinical workflow that integrates the ML model for sepsis treatment recommendations. Then safety analysis is carried out based on the clinical workflow, identifying hazards and safety requirements for the ML model. In this paper the design of the ML model is enhanced to satisfy the safety requirements for mitigating a major clinical hazard: sudden change of vasopressor dose. A rigorous evaluation is conducted to show how these requirements are met. A safety case is presented, providing a basis for regulators to make a judgement on the acceptability of introducing the ML model into sepsis treatment in a healthcare setting. The overall argument is broad in considering the wider patient safety considerations, but the detailed rationale and supporting evidence presented relate to this specific hazard. Whilst there are no agreed regulatory approaches to introducing ML into healthcare, the work presented in this paper has shown a possible direction for overcoming this barrier and exploit the benefits of ML without compromising safety.
Journal of Biomedical Informatics, Volume 117, pp 103760-103760; doi:10.1016/j.jbi.2021.103760

Since the first reported case in Wuhan in late 2019, COVID-19 has rapidly spread worldwide, dramatically impacting the lives of millions of citizens. To deal with the severe crisis resulting from the pandemic, worldwide institutions have been forced to make decisions that profoundly affect the socio-economic realm. In this sense, researchers from diverse knowledge areas are investigating the behavior of the disease in a rush against time. In both cases, the lack of reliable data has been an obstacle to carry out such tasks with accuracy. To tackle this challenge, COnVIDa ( has been designed and developed as a user-friendly tool that easily gathers rigorous multidisciplinary data related to the COVID-19 pandemic from different data sources. In particular, the pandemic expansion is analyzed with variables of health nature, but also social ones, mobility, etc. Besides, COnVIDa permits to smoothly join such data, compare and download them for further analysis. Due to the open-science nature of the project, COnVIDa is easily extensible to any other region of the planet. In this way, COnVIDa becomes a data facilitator for decision-making processes, as well as a catalyst for new scientific researches related to this pandemic.
Hao Liu, Yuan Chi, Alex Butler, Yingcheng Sun,
Journal of Biomedical Informatics, Volume 117; doi:10.1016/j.jbi.2021.103771

We present the Clinical Trial Knowledge Base, a regularly updated knowledge base of discrete clinical trial eligibility criteria equipped with a web-based user interface for querying and aggregate analysis of common eligibility criteria. We used a natural language processing (NLP) tool named Criteria2Query (Yuan et al., 2019) to transform free text clinical trial eligibility criteria from into discrete criteria concepts and attributes encoded using the widely adopted Observational Medical Outcomes Partnership (OMOP) Common Data Model (CDM) and stored in a relational SQL database. A web application accessible via RESTful APIs was implemented to enable queries and visual aggregate analyses. We demonstrate CTKB’s potential role in EHR phenotype knowledge engineering using ten validated phenotyping algorithms. At the time of writing, CTKB contained 87,504 distinctive OMOP CDM standard concepts, including Condition (47.82%), Drug (23.01%), Procedure (13.73%), Measurement (24.70%) and Observation (5.28%), with 34.78% for inclusion criteria and 65.22% for exclusion criteria, extracted from 352,110 clinical trials. The average hit rate of criteria concepts in eMERGE phenotype algorithms is 77.56%. CTKB is a novel comprehensive knowledge base of discrete eligibility criteria concepts with the potential to enable knowledge engineering for clinical trial cohort definition, clinical trial population representativeness assessment, electronical phenotyping, and data gap analyses for using electronic health records to support clinical trial recruitment.
Journal of Biomedical Informatics, Volume 117; doi:10.1016/j.jbi.2021.103770

COVID-19 has reinforced healthcare organizations to speak a common language to tackle the challenges of this pandemic. Using traditional health information systems (HISs) with different technologies in hospitals leads to usability and incompatibility issues because of islands of information. By reshaping data sharing frameworks, healthcare professionals will have the tools allowing them to exchange in real-time important patient health information. This information is needed to tackle the current crisis, and any we may face in the future.
Gang Yu, ZhongZhi Yu, Yemin Shi, Yingshuo Wang, Xiaoqing Liu, Zheming Li, Yonggen Zhao, Fenglei Sun, ,
Journal of Biomedical Informatics, Volume 117; doi:10.1016/j.jbi.2021.103754

Respiratory diseases, including asthma, bronchitis, pneumonia, and upper respiratory tract infection (RTI), are among the most common diseases in clinics. The similarities among the symptoms of these diseases precludes prompt diagnosis upon the patients’ arrival. In pediatrics, the patients’ limited ability in expressing their situation makes precise diagnosis even harder. This becomes worse in primary hospitals, where the lack of medical imaging devices and the doctors’ limited experience further increase the difficulty of distinguishing among similar diseases. In this paper, a pediatric fine-grained diagnosis-assistant system is proposed to provide prompt and precise diagnosis using solely clinical notes upon admission, which would assist clinicians without changing the diagnostic process. The proposed system consists of two stages: a test result structuralization stage and a disease identification stage. The first stage structuralizes test results by extracting relevant numerical values from clinical notes, and the disease identification stage provides a diagnosis based on text-form clinical notes and the structured data obtained from the first stage. A novel deep learning algorithm was developed for the disease identification stage, where techniques including adaptive feature infusion and multi-modal attentive fusion were introduced to fuse structured and text data together. Clinical notes from over 12000 patients with respiratory diseases were used to train a deep learning model, and clinical notes from a non-overlapping set of about 1800 patients were used to evaluate the performance of the trained model. The average precisions (AP) for pneumonia, RTI, bronchitis and asthma are 0.878, 0.857, 0.714, and 0.825, respectively, achieving a mean AP (mAP) of 0.819. These results demonstrate that our proposed fine-grained diagnosis-assistant system provides precise identification of the diseases.
, Mari Ostendorf, Matthew Thompson, Meliha Yetisgen
Journal of Biomedical Informatics, Volume 117, pp 103761-103761; doi:10.1016/j.jbi.2021.103761

Coronavirus disease 2019 (COVID-19) is a global pandemic. Although much has been learned about the novel coronavirus since its emergence, there are many open questions related to tracking its spread, describing symptomology, predicting the severity of infection, and forecasting healthcare utilization. Free-text clinical notes contain critical information for resolving these questions. Data-driven, automatic information extraction models are needed to use this text-encoded information in large-scale studies. This work presents a new clinical corpus, referred to as the COVID-19 Annotated Clinical Text (CACT) Corpus, which comprises 1,472 notes with detailed annotations characterizing COVID-19 diagnoses, testing, and clinical presentation. We introduce a span-based event extraction model that jointly extracts all annotated phenomena, achieving high performance in identifying COVID-19 and symptom events with associated assertion values (0.83–0.97 F1 for events and 0.73–0.79 F1 for assertions). Our span-based event extraction model outperforms an extractor built on MetaMapLite for the identification of symptoms with assertion values. In a secondary use application, we predicted COVID-19 test results using structured patient data (e.g. vital signs and laboratory results) and automatically extracted symptom information, to explore the clinical presentation of COVID-19. Automatically extracted symptoms improve COVID-19 prediction performance, beyond structured data alone.
Eric Prud'Hommeaux, Josh Collins, David Booth, Kevin J. Peterson, Harold R. Solbrig,
Journal of Biomedical Informatics, Volume 117; doi:10.1016/j.jbi.2021.103755

Resource Description Framework (RDF) is one of the three standardized data formats in the HL7 Fast Healthcare Interoperability Resources (FHIR) specification and is being used by healthcare and research organizations to join FHIR and non-FHIR data. However, RDF previously had not been integrated into popular FHIR tooling packages, hindering the adoption of FHIR RDF in the semantic web and other communities. The objective of the study is to develop and evaluate a Java based FHIR RDF data transformation toolkit to facilitate the use and validation of FHIR RDF data. We extended the popular HAPI FHIR tooling to add RDF support, thus enabling FHIR data in XML or JSON to be transformed to or from RDF. We also developed an RDF Shape Expression (ShEx)-based validation framework to verify conformance of FHIR RDF data to the ShEx schemas provided in the FHIR specification for FHIR versions R4 and R5. The effectiveness of ShEx validation was demonstrated by testing it against 2693 FHIR R4 examples and 2197 FHIR R5 examples that are included in the FHIR specification. A total of 5 types of errors including missing properties, unknown element, missing resource Type, invalid attribute value, and unknown resource name in the R5 examples were revealed, demonstrating the value of the ShEx in the quality assurance of the evolving R5 development. This FHIR RDF data transformation and validation framework, based on HAPI and ShEx, is robust and ready for community use in adopting FHIR RDF, improving FHIR data quality, and evolving the FHIR specification.
Journal of Biomedical Informatics, Volume 117; doi:10.1016/j.jbi.2021.103724

Causal inference is one of the most fundamental problems across all domains of science. We address the problem of inferring a causal direction from two observed discrete symbolic sequences X and Y. We present a framework which relies on lossless compressors for inferring context-free grammars (CFGs) from sequence pairs and quantifies the extent to which the grammar inferred from one sequence compresses the other sequence. We infer X causes Y if the grammar inferred from X better compresses Y than in the other direction. To put this notion to practice, we propose three models that use the Compression-Complexity Measures (CCMs) - Lempel-Ziv (LZ) complexity and Effort-To-Compress (ETC) to infer CFGs and discover causal directions without demanding temporal structures. We evaluate these models on synthetic and real-world benchmarks and empirically observe performances competitive with current state-of-the-art methods. Lastly, we present two unique applications of the proposed models for causal inference directly from pairs of genome sequences belonging to the SARS-CoV-2 virus. Using a large number of sequences, we show that our models capture directed causal information exchange between sequence pairs, presenting novel opportunities for addressing key issues such as contact-tracing, motif discovery, evolution of virulence and pathogenicity in future applications.
, Danilo M. Eler, Rogério E. Garcia, Ronaldo C.M. Correia, Rafael M.B. Rodrigues
Journal of Biomedical Informatics, Volume 117, pp 103753-103753; doi:10.1016/j.jbi.2021.103753

Visual analytics techniques are useful tools to support decision-making and cope with increasing data, particularly to monitor natural or artificial phenomena. When monitoring disease progression, visual analytics approaches help decision-makers to understand or even prevent dissemination paths. In this paper, we propose a new visual analytics tool for monitoring COVID-19 dissemination. We use k-nearest neighbors of cities to mimic neighboring cities and analyze COVID-19 dissemination based on comparing a city under consideration and its neighborhood. Moreover, such analysis is performed within periods, which facilitates the assessment of isolation policies. We validate our tool by analyzing the progression of COVID-19 in neighboring cities of São Paulo state, Brazil.
, Sara C. Madeira, , Mamede de Carvalho, Alexandra M. Carvalho
Journal of Biomedical Informatics, Volume 117; doi:10.1016/j.jbi.2021.103730

Amyotrophic lateral sclerosis (ALS) is a neurodegenerative disease causing patients to quickly lose motor neurons. The disease is characterized by a fast functional impairment and ventilatory decline, leading most patients to die from respiratory failure. To estimate when patients should get ventilatory support, it is helpful to adequately profile the disease progression. For this purpose, we use dynamic Bayesian networks (DBNs), a machine learning model, that graphically represents the conditional dependencies among variables. However, the standard DBN framework only includes dynamic (time-dependent) variables, while most ALS datasets have dynamic and static (time-independent) observations. Therefore, we propose the sdtDBN framework, which learns optimal DBNs with static and dynamic variables. Besides learning DBNs from data, with polynomial-time complexity in the number of variables, the proposed framework enables the user to insert prior knowledge and to make inference in the learned DBNs. We use sdtDBNs to study the progression of 1214 patients from a Portuguese ALS dataset. First, we predict the values of every functional indicator in the patients’ consultations, achieving results competitive with state-of-the-art studies. Then, we determine the influence of each variable in patients’ decline before and after getting ventilatory support. This insightful information can lead clinicians to pay particular attention to specific variables when evaluating the patients, thus improving prognosis. The case study with ALS shows that sdtDBNs are a promising predictive and descriptive tool, which can also be applied to assess the progression of other diseases, given time-dependent and time-independent clinical observations.
Mehdi Mirzapour, Amine Abdaoui, Andon Tchechmedjiev, William Digan, Sandra Bringay,
Journal of Biomedical Informatics, Volume 117; doi:10.1016/j.jbi.2021.103733

The context of medical conditions is an important feature to consider when processing clinical narratives. NegEx and its extension ConText became the most well-known rule-based systems that allow determining whether a medical condition is negated, historical or experienced by someone other than the patient in English clinical text. In this paper, we present a French adaptation and enrichment of FastContext which is the most recent, n-trie engine-based implementation of the ConText algorithm. We compiled an extensive list of French lexical cues by automatic and manual translation and enrichment. To evaluate French FastContext, we manually annotated the context of medical conditions present in two types of clinical narratives: (i) death certificates and (ii) electronic health records. Results show good performance across different context values on both types of clinical notes (on average 0.93 and 0.86 F1, respectively). Furthermore, French FastContext outperforms previously reported French systems for negation detection when compared on the same datasets and it is the first implementation of contextual temporality and experiencer identification reported for French. Finally, French FastContext has been implemented within the SIFR Annotator: a publicly accessible Web service to annotate French biomedical text data ( To our knowledge, this is the first implementation of a Web-based ConText-like system in a publicly accessible platform allowing non-natural-language-processing experts to both annotate and contextualize medical conditions in clinical notes.
Mike Wong, Paul Previde, Jack Cole, Brook Thomas, Nayana Laxmeshwar, Emily Mallory, Jake Lever, Dragutin Petkovic, Russ B. Altman,
Journal of Biomedical Informatics, Volume 117; doi:10.1016/j.jbi.2021.103732

Understanding the relationships between genes, drugs, and disease states is at the core of pharmacogenomics. Two leading approaches for identifying these relationships in medical literature are: human expert led manual curation efforts, and modern data mining based automated approaches. The former generates small amounts of high-quality data, and the later offers large volumes of mixed quality data. The algorithmically extracted relationships are often accompanied by supporting evidence, such as, confidence scores, source articles, and surrounding contexts (excerpts) from the articles, that can used as data quality indicators. Tools that can leverage these quality indicators to help the user gain access to larger and high-quality data are needed. We introduce GeneDive, a web application for pharmacogenomics researchers and precision medicine practitioners that makes gene, disease, and drug interactions data easily accessible and usable. GeneDive is designed to meet three key objectives: (1) provide functionality to manage information-overload problem and facilitate easy assimilation of supporting evidence, (2) support longitudinal and exploratory research investigations, and (3) offer integration of user-provided interactions data without requiring data sharing. GeneDive offers multiple search modalities, visualizations, and other features that guide the user efficiently to the information of their interest. To facilitate exploratory research, GeneDive makes the supporting evidence and context for each interaction readily available and allows the data quality threshold to be controlled by the user as per their risk tolerance level. The interactive search-visualization loop enables relationship discoveries between diseases, genes, and drugs that might not be explicitly described in literature but are emergent from the source medical corpus and deductive reasoning. The ability to utilize user’s data either in combination with the GeneDive native datasets or in isolation promotes richer data-driven exploration and discovery. These functionalities along with GeneDive’s applicability for precision medicine, bringing the knowledge contained in biomedical literature to bear on particular clinical situations and improving patient care, are illustrated through detailed use cases. GeneDive is a comprehensive, broad-use biological interactions browser. The GeneDive application and information about its underlying system architecture are available at GeneDive Docker image is also available for download at this URL, allowing users to (1) import their own interaction data securely and privately; and (2) generate and test hypotheses across their own and other datasets.
, Peng Wei, Elmer V. Bernstam, Richard D. Boyce, Trevor Cohen
Journal of Biomedical Informatics, Volume 117; doi:10.1016/j.jbi.2021.103719

Drug safety research asks causal questions but relies on observational data. Confounding bias threatens the reliability of studies using such data. The successful control of confounding requires knowledge of variables called confounders affecting both the exposure and outcome of interest. Causal knowledge of dynamic biological systems is complex and challenging. Fortunately, computable knowledge mined from the literature may hold clues about confounders. In this paper, we tested the hypothesis that incorporating literature-derived confounders can improve causal inference from observational data. We introduce two methods (semantic vector-based and string-based confounder search) that query literature-derived information for confounder candidates to control, using SemMedDB, a database of computable knowledge mined from the biomedical literature. These methods search SemMedDB for confounders by applying semantic constraint search for indications treated by the drug (exposure), that are also known to cause the adverse event (outcome). We then include the literature-derived confounder candidates in statistical and causal models derived from free-text clinical notes. For evaluation, we use a reference dataset widely used in drug safety containing labeled pairwise relationships between drugs and adverse events and attempt to rediscover these relationships from a corpus of 2.2M NLP-processed free-text clinical notes. We employ standard adjustment and causal inference procedures to predict and estimate causal effects by informing the models with varying numbers of literature-derived confounders and instantiating the exposure, outcome, and confounder variables in the models with dichotomous EHR-derived data. Finally, we compare the results from applying these procedures with naive measures of association (χ2 and reporting odds ratio) and with each other. We found semantic vector-based search to be superior to string-based search at reducing confounding bias. However, the effect of including more rather than fewer literature-derived confounders was inconclusive. We recommend using targeted learning estimation methods that can address treatment-confounder feedback, where confounders that also behave as intermediate variables, and engaging subject-matter experts to adjudicate the handling of problematic confounders.
Thomas Ferté, , Thierry Schaeverbeke, Thomas Barnetche, Vianney Jouhet, Boris P. Hejblum
Journal of Biomedical Informatics, Volume 117; doi:10.1016/j.jbi.2021.103746

Electronic Health Records (EHRs) often lack reliable annotation of patient medical conditions. Phenorm, an automated unsupervised algorithm to identify patient medical conditions from EHR data, has been developed. PheVis extends PheNorm at the visit resolution. PheVis combines diagnosis codes together with medical concepts extracted from medical notes, incorporating past history in a machine learning approach to provide an interpretable parametric predictor of the occurrence probability for a given medical condition at each visit. PheVis is applied to two real-world use-cases using the datawarehouse of the University Hospital of Bordeaux: i) rheumatoid arthritis, a chronic condition; ii) tuberculosis, an acute condition. Cross-validated AUROC were respectively 0.943 [0.940 ; 0.945] and 0.987 [0.983 ; 0.990]. Cross-validated AUPRC were respectively 0.754 [0.744 ; 0.763] and 0.299 [0.198 ; 0.403]. PheVis performs well for chronic conditions, though absence of exclusion of past medical history by natural language processing tools limits its performance in French for acute conditions. It achieves significantly better performance than state-of-the-art unsupervised methods especially for chronic diseases.
Journal of Biomedical Informatics, Volume 117; doi:10.1016/s1532-0464(21)00134-9

Journal of Biomedical Informatics, Volume 117; doi:10.1016/s1532-0464(21)00135-0

Journal of Biomedical Informatics, Volume 117, pp 103751-103751; doi:10.1016/j.jbi.2021.103751

COVID-19 was first discovered in December 2019 and has continued to rapidly spread across countries worldwide infecting thousands and millions of people. The virus is deadly, and people who are suffering from prior illnesses or are older than the age of 60 are at a higher risk of mortality. Medicine and Healthcare industries have surged towards finding a cure, and different policies have been amended to mitigate the spread of the virus. While Machine Learning (ML) methods have been widely used in other domains, there is now a high demand for ML-aided diagnosis systems for screening, tracking, predicting the spread of COVID-19 and finding a cure against it. In this paper, we present a journey of what role ML has played so far in combating the virus, mainly looking at it from a screening, forecasting, and vaccine perspective. We present a comprehensive survey of the ML algorithms and models that can be used on this expedition and aid with battling the virus.
, Dörthe Arndt, Jos De Roo, Erik Mannens
Journal of Biomedical Informatics, Volume 117; doi:10.1016/j.jbi.2021.103750

Clinical decision support systems are assisting physicians in providing care to patients. However, in the context of clinical pathway management such systems are rather limited as they only take the current state of the patient into account and ignore the possible evolvement of that state in the future. In the past decade, the availability of big data in the healthcare domain did open a new era for clinical decision support. Machine learning technologies are now widely used in the clinical domain, nevertheless, mostly as a tool for disease prediction. A tool that not only predicts future states, but also enables adaptive clinical pathway management based on these predictions is still in need. This paper introduces weighted state transition logic, a logic to model state changes based on actions planned in clinical pathways. Weighted state transition logic extends linear logic by taking weights – numerical values indicating the quality of an action or an entire clinical pathway – into account. It allows us to predict the future states of a patient and it enables adaptive clinical pathway management based on these predictions. We provide an implementation of weighted state transition logic using semantic web technologies, which makes it easy to integrate semantic data and rules as background knowledge. Executed by a semantic reasoner, it is possible to generate a clinical pathway towards a target state, as well as to detect potential conflicts in the future when multiple pathways are coexisting. The transitions from the current state to the predicted future state are traceable, which builds trust from human users on the generated pathway.
Juan Zhao, Monika E. Grabowska, Vern Eric Kerchberger, Joshua C. Smith, H. Nur Eken, , , S. Trent Rosenbloom, Kevin B. Johnson,
Journal of Biomedical Informatics, Volume 117, pp 103748-103748; doi:10.1016/j.jbi.2021.103748

Identifying symptoms and characteristics highly specific to coronavirus disease 2019 (COVID-19) would improve the clinical and public health response to this pandemic challenge. Here, we describe a high-throughput approach – Concept-Wide Association Study (ConceptWAS) – that systematically scans a disease's clinical manifestations from clinical notes. We used this method to identify symptoms specific to COVID-19 early in the course of the pandemic. We created a natural language processing pipeline to extract concepts from clinical notes in a local ER corresponding to the PCR testing date for patients who had a COVID-19 test and evaluated these concepts as predictors for developing COVID-19. We identified predictors from Firth's logistic regression adjusted by age, gender, and race. We also performed ConceptWAS using cumulative data every two weeks to identify the timeline for recognition of early COVID-19-specific symptoms. We processed 87,753 notes from 19,692 patients subjected to COVID-19 PCR testing between March 8, 2020, and May 27, 2020 (1,483 COVID-19-positive). We found 68 concepts significantly associated with a positive COVID-19 test. We identified symptoms associated with increasing risk of COVID-19, including “anosmia” (odds ratio [OR] = 4.97, 95% confidence interval [CI] = 3.21–7.50), “fever” (OR = 1.43, 95% CI = 1.28–1.59), “cough with fever” (OR = 2.29, 95% CI = 1.75–2.96), and “ageusia” (OR = 5.18, 95% CI = 3.02–8.58). Using ConceptWAS, we were able to detect loss of smell and loss of taste three weeks prior to their inclusion as symptoms of the disease by the Centers for Disease Control and Prevention (CDC). ConceptWAS, a high-throughput approach for exploring specific symptoms and characteristics of a disease like COVID-19, offers a promise for enabling EHR-powered early disease manifestations identification.
Yan Huang, Xiaojin Li,
Journal of Biomedical Informatics, Volume 117; doi:10.1016/j.jbi.2021.103744

Fast temporal query on large EHR-derived data sources presents an emerging big data challenge, as this query modality is intractable using conventional strategies that have not focused on addressing Covid-19-related research needs at scale. We introduce a novel approach called Event-level Inverted Index (ELII) to optimize time trade-offs between one-time batch preprocessing and subsequent open-ended, user-specified temporal queries. An experimental temporal query engine has been implemented in a NoSQL database using our new ELII strategy. Near-real-time performance was achieved on a large Covid-19 EHR dataset, with 1.3 million unique patients and 3.76 billion records. We evaluated the performance of ELII on several types of queries: classical (non-temporal), absolute temporal, and relative temporal. Our experimental results indicate that ELII accomplished these queries in seconds, achieving average speed accelerations of 26.8 times on relative temporal query, 88.6 times on absolute temporal query, and 1037.6 times on classical query compared to a baseline approach without using ELII. Our study suggests that ELII is a promising approach supporting fast temporal query, an important mode of cohort development for Covid-19 studies.
Journal of Biomedical Informatics, Volume 117; doi:10.1016/j.jbi.2021.103763

Machine learning methodologies are gaining popularity for developing medical prediction models for datasets with a large number of predictors, particularly in the setting of clustered and longitudinal data. Binary Mixed Model (BiMM) forest is a promising machine learning algorithm which may be applied to develop prediction models for clustered and longitudinal binary outcomes. Although machine learning methods for clustered and longitudinal methods such as BiMM forest exist, feature selection has not been analyzed via data simulations. Feature selection improves the practicality and ease of use of prediction models for clinicians by reducing the burden of data collection. Thus, feature selection procedures are not only beneficial, but are often necessary for development of medical prediction models. In this study, we aim to assess feature selection within the BiMM forest setting for modeling clustered and longitudinal binary outcomes. We conducted a simulation study to compare BiMM forest with feature selection (backward elimination or stepwise selection) to standard generalized linear mixed model feature selection methods (shrinkage and backward elimination). We also evaluated feature selection methods to develop models predicting mobility disability in older adults using the Health, Aging and Body Composition Study dataset as an example utilization of the proposed methodology. BiMM forest with backward elimination generally offered high computational efficiency, similar or higher predictive performance (accuracy and area under the receiver operating curve), and similar or higher ability to identify correct features compared to linear methods for the different simulated scenarios. For predicting mobility disability in older adults, methods generally performed similarly in terms of accuracy, area under the receiver operating curve, and specificity; however, BiMM forest with backward elimination had the highest sensitivity. This study is novel because it is the first investigation of feature selection for developing random forest prediction models for clustered and longitudinal binary outcomes. Results from the simulation study reveal that BiMM forest with backward elimination has the highest accuracy (performance and identification of correct features) and lowest computation time compared to other feature selection methods in some scenarios and similar performance in other scenarios. Many informatics datasets have clustered and longitudinal outcomes and results from this study suggest that BiMM forest with backward elimination may be beneficial for developing medical prediction models.
, Jane S. Kim, Chen Yeh, Jungwha Lee, Kevin J. O'Leary
Journal of Biomedical Informatics, Volume 117; doi:10.1016/j.jbi.2021.103749

Secure mobile communication technologies are being implemented at an increasing rate across health care organizations, though providers’ use of these tools can remain limited by a perceived lack of other users to communicate with. Enabling acceptance and driving provider utilization of these tools throughout an organization requires attention to the interplay between perceived peer usage (i.e. perceived critical mass) and local user needs within the social context of the care team (e.g. inpatient nursing access to the mobile app). To explain these influences, we developed and tested a consolidated model that explains how mobile health care communication technology acceptance and utilization are influenced by the moderating effects of social context on perceptions about the technology. The theoretical model and questionnaire were derived from selected technology acceptance models and frameworks. Survey respondents (n = 1,254) completed items measuring perceived critical mass, perceived usefulness, perceived ease of use, personal innovativeness in information technology, behavioral intent, and actual use of the Vocera communication platform. Actual use was additionally measured by logged usage data. Use group was defined as whether a hospital’s nurses had access to the tool (expanded use group) or not (limited use group). The model accounted for 61% and 72% of the variance in intent to use the communication tool in the limited and expanded use groups, respectively, which in turn accounted for 53% and 33% of actual use. The total effects coefficient of perceived critical mass on behavioral intent was 0.57 in the limited use group (95% CI 0.51 – 0.63) and 0.70 in the expanded use group (95% CI 0.61 – 0.80). Our model fit the data well and explained the majority of variance in acceptance of the tool amongst participants. The overall influence of perceived critical mass on intent to use the tool was similarly large in both groups. However, the strength of multiple model pathways varied unexpectedly by use group, suggesting that using sociotechnical-based moderators combined with traditional technology acceptance models may produce greater insights than traditional technology acceptance models alone. Practically, our results suggest that healthcare institutions can drive acceptance by promoting the recruitment of early adopters though liberal access policies and making these users and the technology highly visible to others.
Page of 61
Articles per Page
Show export options
  Select all
Back to Top Top