DREAM completed: Scientists use massive data from March of Dimes repository for discovery

July 24, 2023

Scientists at the March of Dimes Prematurity Research Center (PRC) at the University of California San Francisco (UCSF) have submitted a paper detailing their DREAM Challenge, in which computational scientists from all over the world used publicly available microbiome data from across the PRC network and beyond to make predictive models on preterm birth—a testament to the organization’s ardent belief in open science.  

The challenge, which took place last summer and culminated in a trip to the DREAM Conference in Las Vegas for the winning teams in the fall of 2022, entailed using vaginal microbiome data from March of Dimes PRCs—Stanford, UCSF, Imperial College London, and University of Pennsylvania—as well as several other datasets that are part of the March of Dimes Database for Preterm Birth Research.  

The publicly-facing database has been amassing data since its inception in 2015 by Dr. Marina Sirota, an associate professor in the Bakar Computational Health Sciences Institute and the department of Pediatrics at UCSF, as well as several colleagues. Dr. Sirota is also the principal investigator at the March of Dimes PRC at UCSF.  

A vault of every piece of molecular data that has come out of a PRC, the database comprises 68 studies with more than 40,000 experimental samples from nearly 30,000 participants and more than 30 types of measurements.

It includes genomic, transcriptomic, immunological, microbiome, and other data that’s available to the scientific community at large and accessible with a click of a mouse.  

To date, the data has been downloaded more than 5,400 times and has enabled a number of research studies, including those for identifying new biomarkers and therapeutic strategies for preterm birth.  

The database, co-directed by Dr. Sirota and Dr. Tomiko Oskotsky, a senior research scientist in Dr. Sirota’s lab at UCSF, is the only public multi-omic data repository for preterm birth, and underscores March of Dimes’ commitment to open science, collaboration, and accelerating the pace of research progress in the field of preterm birth.  

“What drives March of Dimes research is not copyright, patent, or financial gain,” said March of Dimes Chief Scientific Officer Dr. Emre Seli. “It’s the ability to innovate, discover, and bring life-changing therapies to moms and babies in a smart and efficient manner.”  

“And the best way to do that is to open our data bank to doctors, scientists, and machine learning experts all over the world and tap into their collective brilliance to solve problems.”  

“That’s why we’ve been doing this for nearly a decade already—because together, we believe we can go far.”  

The repository, which lives online perpetually, also ensures that the benefits of a research project are not limited to its immediate conclusions or a specific time frame. For example, data (often in the form of analytic findings from samples like blood, urine, or tissue) collected during a study period is accessible for years to come in the database, whereas it would be hard to access after a study concluded in a more traditional setting.  

The database also makes it easier to do validation studies, which is the replication of a scientific study with a larger and more diverse population to ensure that its conclusions can be generalized to the wider population. The March of Dimes data biorepository makes validation easier because it contains data from a wide array of samples collected at different time periods. These can be used to test and validate a novel diagnostic approach quickly without having to take time and resources to organize a large scientific study for sample collection.  

But the most exciting—and powerful—aspect of the biorepository is its collaborative aspect: discovery potential is increased exponentially by making data available to experts outside March of Dimes.  

To this end, the PRC at UCSF has been involved in organizing several DREAM Challenges, which are open science competitions that seek to advance understanding of biology and disease. The competitions are open to biomedical researchers from academia and industry all over the world in hopes of crowdsourcing answers to some of science’s most puzzling questions and spurring innovation through collaboration.  

There have been numerous DREAM Challenges across the biomedical spectrum over the years, with the UCSF and Stanford PRCs being involved in organizing one other in the past.  

In 2019, members of the two PRCs were involved in a challenge that sought to create predictive models on two things based on a pregnant woman’s blood sample: the gestational age of her baby at the time of blood draw and the mom’s risk of spontaneous preterm birth later in pregnancy. The challenge, which focused on gene and protein expression, was led by a Wayne State University team.  

In the latest challenge held during the summer of 2022, based on vaginal microbiome data, machine learning experts all over the world submitted models that predicted risk for early preterm birth and preterm birth. The winning models were validated in silico and the PRC team at UCSF have shared their findings and learnings in a recent preprint (the manuscript is currently under review).  

The paper, which is a collaborative effort with over 50 authors, details how investigators used vaginal microbiome to develop two predictive models for preterm birth. One for preterm birth (births before 37 weeks of pregnancy) and another for early preterm birth (births prior to 32 weeks of pregnancy).  

The winning teams (one from Italy, one from Korea, and two from the U.S.) created fairly accurate predictive models. The early preterm birth model had an 87% accuracy and the preterm birth model had a 69% accuracy - both scientifically significant scores.  

The model creation is only part of the impressive work. Aggregating the data in a way that makes it useful for machine learning experts is another. To aggregate the data, UCSF scientists had to first gather data from nine smaller studies on the topic and combine them in a way where the data—comprised of more than 3,500 samples from more than 1,200 pregnant individuals—was useful. This was difficult because the studies investigated different regions of the same gene. (The gene, known as 16S, is commonly used to identify different species of bacteria in the microbiome). As a result, the data were coded differently.

With so much variability in the data, PRC scientists had to figure out how to harmonize, or make useful, the data; otherwise they’d be comparing apples to oranges, and the data would be unworkable.  

To get around this, the PRC team leveraged an open-source tool called MaLiAmPi, developed by a close collaborator, Dr. Jonathan Golob, who is an Assistant Professor of Internal Medicine in the Division of Infectious Diseases at the University of Michigan. His tool harmonized the data, which has been placed in a visual atlas - the topic of a separate paper that’s also on the verge of publishing (the preprint is available here).  

“By bringing together disparate pieces of data and unifying it all into one comprehensive dataset, which represents one of the largest and most geographically diverse encyclopedias of the vaginal microbiome in pregnancy, we were able to set the stage for machine learning experts to make reliable predictive models on the vaginal microbiome and preterm birth,” said Dr. Golob.

Dr. Oskotsky said the effort has resulted in vital learnings for those studying preterm birth.  

“This work has allowed us to make some definitive statements about the vaginal microbiome and preterm birth: first, the makeup of the vaginal microbiome is a risk factor for preterm birth; and second, we now have a predictive tool that can potentially identify women at risk.”  

Dr. Sirota said that discussions are already underway to gauge whether the models could be ready for clinical introduction in the future.  

She said her team, in collaboration with Dr. Golob, Stanford PRC researcher Dr. Nima Aghaeepour, and others, is continuing to strengthen and improve the models by integrating more data like clinical and genetic risk factors, as well as markers of inflammation. In this way, the models may be able to predict risk factors for many more kinds of preterm birth, not just those related to microbiome.  

She also said she’s passionate about tapping into the repository data to make novel diagnostic and therapeutic predictions about preterm birth to improve the lives of moms and babies globally, as well as enabling experts around the world to do the same with the goal of speeding the pace of discovery in the field—together.