Tracking the 2022 Monkeypox Outbreak: Challenges in Data Curation

Jun 24, 2022

“All data are created; data never simply exist.”

We create data. Lots of data. The Global.health COVID-19 dataset now contains detailed information on over 100 million individual, de-identified cases from over 100 countries, including 60+ fields of metadata, comprising over 1 billion unique data points. COVID-19 provided a proof-of-concept for system design, infrastructure, and standards to build and maintain a global infectious disease repository and visualization platform with a pathogen-agnostic model. Monkeypox (MPX) is a live system test to support global surveillance efforts. This is an opportunity to demonstrate the value and utility of the G.h system and to showcase our ability to quickly adapt focus to an emerging or re-emerging pathogen.

Our international network of curators are dedicated to tracking the 2022 MPX outbreak around the clock to create, organize, and maintain line-list data for decision makers, researchers, and the public to use in real-time: open-access database (under CC BY 4.0 license), daily situation report, and map visualization. As of June 24, 2022, G.h reports at least 3,503 confirmed and 115 suspected MPX cases in 59 countries, see figure below. Our data is trusted, granular, and global, and has become the de facto standard for open source MPX data. G.h data is being used and referenced by other organizations for visualization, forecasting and analysis [e.g. Our World in Data, Metaculus, Monkeypox Meter, Nature News].

Figure: Number of cumulative confirmed cases and number of countries who have reported confirmed cases. Monkeypox.global.health

This blog post highlights some of the ongoing curation challenges associated with building an emerging disease outbreak dataset in real-time.

We aim for timeliness in reporting and accuracy of data. Curation, especially early in the outbreak, is a manual, labor-intensive process. Data are often less structured and move at differing speeds with unpredictable frequency. The G.h sourcing strategy helps us to identify, vet, and ingest a wide range of MPX data in a timely manner. Data are de-identified to protect privacy, discourage stigma, and support equitable health research. Timeliness can be a tradeoff for accuracy and completeness. We aim to maximize metadata, but can be limited by inconsistent, aggregated, or missing case information (e.g. demographics, location, exposure histories, symptoms, travel, testing); case information and details evolve and are updated by our team as it becomes available.

Let’s highlight some reporting challenges with Spain, the country with the second largest case count (reporting 736 confirmed cases as of 2022-06-23). Currently, we are observing major reporting time delays and data gaps in confirmed/ suspected cases between Spain’s official health alerts and the news media, especially for the Madrid region; these delays and data discrepancies make it a challenge to track the current scope of the outbreak in real-time.

Also, media reports may provide case counts at the country-level, without more granular detail. This makes it a challenge to disentangle suspected, confirmed, and discarded cases, specifically when large sums or increases are presented at the country-level, without geographic granularity needed for integral tracking and meaningful epidemiological analyses. Our team is sometimes able to gather more detail from local reporting, and deduce through experience and media familiarity what case(s) are being referenced and what updates should be made to line-list data, but these updates are time and resource intensive and subject to human error. Assumptions are made that may compromise the accuracy of the data.

Additionally, we often see media reports for the total number of samples that have been tested and, for example, may not consistently include details on when the samples were tested, how many samples tested positive or negative, the remaining number of suspected cases, or specific locations. So, to account for these changes, our team makes assumptions (for location, status, and other data) to match our database's case count balance to official reports. While we have used Spain as an example, reporting challenges are a universal problem – observed for almost every country.

As you can see, this process can be quite messy and complicated, so a curator becomes familiar with the reporting intricacies of a particular country or media source. Further, there are limitations to workload, and humans make mistakes. Our accuracy is tied to our transparency – we link back to the original media source for every single case. We support a transparent, crowdsourced approach to sharing epidemiological information across borders and welcome meaningful collaboration and fact checks through GitHub “to put the public back in public health.” We thank our user community for the many helpful contributions we have received through our GitHub and email [info@global.health].

We aim to create standardized data. As case definitions were developed and refined, the WHO provided more detail for country reporting requirements. We can now adapt our schema to capture details from the WHO’s case investigation form and minimum dataset case reporting form. The G.h system is flexible and scalable to capture minimum case data plus relevant pathogen- or event-specific context.

We aim to complement existing systems. The G.h dataset is sustainable and complementary to other public health data sets. Both the CDC and WHO - internationally recognized public health leaders - present aggregate data with time delays; G.h line-list data helps to fill the reporting gap.

“Our work attempts to harmonize information across countries and provide additional data to support the epidemiological understanding of the origins and transmission dynamics of this outbreak.” - G.h co-founder Moritz Kraemer, Lancet Infectious Diseases

Until the next post,

The G.h Team

On the Dot: The Global.health Newsletter

Discussion about this post