Using Application Programming Interfaces to Access Google Data for Health Research: Protocol for a Methodological Framework

Abstract
Journal of Medical Internet Research - International Scientific Journal for Medical Research, Information and Communication on the Internet #Preprint #PeerReviewMe: Warning: This is a unreviewed preprint. Readers are warned that the document has not been peer-reviewed by expert/patient reviewers or an academic editor, may contain misleading claims, and is likely to undergo changes before final publication, if accepted, or may have been rejected/withdrawn. Readers with interest and expertise are encouraged to sign up as peer-reviewer, if the paper is within an open peer-review period. Please cite this preprint only for review purposes or for grant applications and CVs (if you are the author). Background: Individuals increasingly are turning to search engines like Google to obtain health information and access resources. Analysis of Google search queries offers a novel approach to understanding in near or real time the sexual and reproductive health concerns and needs of populations. While searches have been examined predominantly with the Google Trends tool, newer Application Program Interfaces (APIs) are now available to academics to draw a richer, more systematic landscape of searches. These APIs allow users to write code in languages like Python to retrieve sample data directly from Google servers. Objective: The purpose of this paper is to describe the protocol for analysis of Google searches obtained from three Google APIs. We empirically tested the protocol and verified its usefulness by comparing search traffic on abortion and birth control in 2017 in the United States (US) and Mississippi (MS). Methods: We used the Google Trends API, the Google Health Trends (also referred to as Flu Trends) API, and the Google Custom Search APIs to obtain search data from Google using Python version 2.7.13. Our simulation protocol consisted of four steps: i) developing a master list of top search queries for abortion and for birth control using the publicly available Google Trends API; ii) gathering information on relative search volume using the private Health Trends API; iii) determining most popular sites using the publicly available Custom Search API, and iv) calculating estimated total search volume for abortion and for birth control. Two separate programmers working independently achieved similar results with insignificant variation due to sample variability. Results: The simulation was successful in obtaining the top search queries, relative search volume and estimated total search volume for both locations during 2017. We were able to overcome the inherent limitations of the datasets with the addition of Planned Parenthood Federation of America website data from 2017 as a baseline for estimated search volume calculations. Nonetheless, we were only able to gain access to the most popular national websites associated with the top queries and propose the use of Google Consumer Surveys to supplement API-generated data at the state level. Conclusions: The methodology proposed in this paper combines data from multiple Google APIs and provides thorough documentation required to systematically identify top search queries and websites, as well as estimate relative and total search volume of queries in real or near-real time in specific locations, allowing for other researchers to replicate the methods used and to advance our understanding of population-level reproductive health concerns.