Welcome to the inaugural edition of On the Dot: The Global.health Newsletter!
Following our public launch, this is the first in a series of periodic updates on our product and dataset, occasional in-depth overviews of our platform and processes, perspectives on the current state of the epi-data landscape, and previews of what the G.h team is working on next.
G.h in the News
Global.health was featured by The Rockefeller Foundation in its announcement of key strategic investments and collaborations in support of the new Pandemic Prevention Institute. Our team is proud to join forces with so many distinguished organizations to improve pandemic preparedness and prevention through trusted data-sharing. We look forward to deepening these partnerships and sharing more news in the months ahead.
G.h co-founders Sam Scarpino and Moritz Kraemer spoke with Bloomberg News about emerging COVID-19 variants and genomic sequencing sampling rates.
ICYMI, you can find a collection of news articles and interviews from our launch and an archive of past coverage on our new Press page.
Database and Product Updates
The G.h COVID-19 line-list database now includes more than 30 million records from over 130 countries: an increase of 20M cases since our launch. This includes an additional 2.2M cases from Argentina, 1.3M cases from Peru, 2.3M cases from Colombia, and ~15M cases from the United States. (Updated list of data sources is available here).
To accompany these new data, the G.h team has prepared two Data Deep Dives to provide additional information about the provenance of the data, level of detail, completeness of different metadata fields, and how these compare across countries. Our first deep dives focus on Colombia and Peru, with more to come soon.
Major investments in our backend have enabled our capacity to ingest and process these significant additions to our database. We’ve reconfigured several parsers and migrated to a new infrastructure to scale and accelerate our data ingestion workflows. This new infrastructure allows us to import large datasets with millions of records. We now have the capacity to ingest over 8 million records in a day!
At the front (end) of the house, we’re excited to introduce enhancements to both our G.h Data and G.h Map applications you can use to explore our growing database.
On the Data app, you can now Filter using a redesigned UI module, and Sort key categories in ascending or descending order. We’ve also updated our Data Guide with helpful details on how best to utilize our improved Filter, Sort, and Search capabilities, which you can combine to further refine your queries.
The G.h Map application also features a streamlined design. Across the board, we’ve refreshed our choropleth color schemes to increase contrast and readability. On Regional View, we’ve replaced 3D extrusions with color-coded and size-relative circles to indicate the availability of line-list data at the admin level (larger, darker circles indicate more cases).
“Under the Hood” of the G.h Data Engine | Gal Wachtel, Anya Lindström Battle, and Felix Jackson on behalf of the Global.health team
In this post, we share our data journey, from selection by our team of researchers to the front-end visualizations you can see on the platform. We will discuss how our global data is collected, vetted, cleaned, and shared securely.
Data collection begins with our global team of researchers who identify the most reliable source of reporting per country. We only include line-list data, meaning data where every row represents a single case rather than an aggregate count. We prioritize using official government counts where possible and revert to other academic or independent reporting where official reporting is not available. Data are only included from sources that permit its dissemination through third parties. To keep our platform as transparent as possible, every row of data is linked back to its original source to always trace the origin.
Once a source has been vetted by at least two members of the team, our engineers work to clean the data. We know data collection during an emerging crisis is error-prone, so all data undergo rigorous cleaning and validation. We look out for anomalous dates or types (we’ve even found a few patients born in the 1800s!). At this stage, we remove all personally identifiable information to minimize the reidentification of patients, such as the patient’s home address or patient ID. Through this process, we ensure that all data sources are transformed into a universal schema so the analysis is as seamless as possible. This schema includes fields for location, age, sex, and symptoms amongst others. (Feel free to look at our code on GitHub!)
Our data are then hosted in a secure environment on AWS. From here it is fed into our publicly available platform. The Global.health interactive map is open to all, and anyone can register to access and download the underlying row-level data.
A few important notes about the data itself: data are updated daily, and we only include cases that are listed as confirmed. However, different countries (and sometimes even sub-regions within a country) have varying policies on when to register a case as confirmed. When working with cases from different regions we recommend looking at the local guidelines for case reporting. Similarly, we recognize that testing varies immensely between locations and recommend considering this when conducting downstream analyses.
We are currently working on building out our platform’s capabilities including deduplication of data, additional privacy measures, and new visualizations. We are always keen to hear of new data sources to add to our platform, answer questions or receive feedback — so please let us know your thoughts as you explore Global.health!
News and Notes
Get Involved! Are you interested in leading or collaborating on a research paper, have unique data to contribute to our platform, or want to participate in user interviews to inform our product development roadmap? We’d love to hear from you! Fill out this brief Google Form and we’ll be in touch.
Welcome aboard to Jim Sheldon, our newest Senior Software Engineer based at Northeastern University! A proof point of both his technical acumen and impeccable music taste, his favorite Daft Punk song is “Robot Rock”. Get to know more about Jim and the G.h team on our About page.
What We’re Reading
A quick round-up of articles, research, and resources our team has been reading, tweeting, and discussing:
Why some researchers oppose unrestricted sharing of coronavirus genome data
Data Visualization for the Understanding of COVID-19 | IEEE Journals & Magazine
Main Report - The Independent Panel for Pandemic Preparedness and Response
The Forever Virus: A Strategy for the Long Fight Against COVID-19
Early epidemiological signatures of novel SARS-CoV-2 variants: establishment of B.1.617.2 in England
Coming Soon
Our team is already hard at work on the next phase of feature development and additions to our database. Look for exciting updates on filtered data downloads, improved login experience, collaborations and integrations, and new data sources in future issues of On the Dot.
We hope you find these updates useful and informative. Our team welcomes your feedback, questions, and ideas. Don’t hesitate to get in touch via email, Twitter, or LinkedIn.
Take care and thank you for supporting our mission to advance the state of open public health data!
Until the next edition,
The G.h Team