Aggregating Twitter Text through Generalized Linear Regression Models for Tweet Popularity Prediction and Automatic Topic Classification

Open Access

26 November 2021

journal article
research article
Published by MDPI AG in European Journal of Investigation in Health, Psychology and Education

Vol. 11 (4), 1537-1554
https://doi.org/10.3390/ejihpe11040109

Abstract

Social media platforms have become accessible resources for health data analysis. However, the advanced computational techniques involved in big data text mining and analysis are challenging for public health data analysts to apply. This study proposes and explores the feasibility of a novel yet straightforward method by regressing the outcome of interest on the aggregated influence scores for association and/or classification analyses based on generalized linear models. The method reduces the document term matrix by transforming text data into a continuous summary score, thereby reducing the data dimension substantially and easing the data sparsity issue of the term matrix. To illustrate the proposed method in detailed steps, we used three Twitter datasets on various topics: autism spectrum disorder, influenza, and violence against women. We found that our results were generally consistent with the critical factors associated with the specific public health topic in the existing literature. The proposed method could also classify tweets into different topic groups appropriately with consistent performance compared with existing text mining methods for automatic classification based on tweet contents.

Keywords

This publication has 16 references indexed in Scilit:

Using Twitter to Detect Psychological Characteristics of Self-Identified Persons With Autism Spectrum Disorder: A Feasibility Study
JMIR mHealth and uHealth, 2019
Factors influencing the probability of a diagnosis of autism spectrum disorder in girls versus boys
Autism, 2016
Sentiment Analysis of Review Datasets Using Naïve Bayes‘ and K-NN Classifier
International Journal of Information Engineering and Electronic Business(ijieeb), 2016
Visualizing Count Data Regressions Using Rootograms
The American Statistician, 2016
Ebola and the social media
The Lancet, 2014
A New Dimension of Health Care: Systematic Review of the Uses, Benefits, and Limitations of Social Media for Health Communication
Journal of Medical Internet Research, 2013
pROC: an open-source package for R and S+ to analyze and compare ROC curves
BMC Bioinformatics, 2011
Text and Structural Data Mining of Influenza Mentions in Web and Social Media
International Journal of Environmental Research and Public Health, 2010
Generalized Low-Rank Approximations of Matrices Revisited
IEEE Transactions on Neural Networks, 2010
Regression Models for Count Data inR
Journal of Statistical Software, 2008

Cited by 3 articles