KI continues its Where Are They Now? Series by checking in with the University of South Dakota’s Paula Mabee nearly ten years after her first KI event and eight months after her last. In February 2020, Mabee will become the Chief Scientist and Observatory Director of the National Ecological Observatory Network (NEON), which collects long-term ecological data to better understand how ecosystems are changing across the United States.
Scientists have long used physical traits to shed light on evolutionary relationships. Sometimes this has worked, like using lactation to group mammals. And, sometimes, it hasn’t: people once thought bats were featherless birds! Today, evolutionary biologists rely more heavily on molecular-based phylogenies to resolve relatedness. But, what they haven’t been able to do is address big picture questions about how observable traits (phenotypes) evolve across higher taxa.
It’s not for lack of data. Millions of papers have been published that describe the phenotypic characteristics of known species. The problem is the data itself. For example, ten years ago if you wanted to ask “How did the loss of fins in bony fishes evolve over time?” you might spend the rest of your career searching the published literature describing fins — or their absence — so that you could do the analysis.
“There was no way to get that information because it had to come from many different studies published in different places at different times and represented in different ways,” says Paula Mabee, Ph.D., who has been a biology professor at the University of South Dakota since 1997.
Mabee and her colleagues have worked for the last decade on changing that. They began by creating a semantic knowledgebase of phenotypes and developed new tools to analyze the data. Now, they are on the verge of realizing a game-changing innovation: the use of AI to automate phenotype data acquisition from biodiverse species images.
Bringing Order to Data Chaos
In 2018, Mabee and her colleagues published a paper in Systematic Biology on the loss of paired fins in bony fishes. She calls it a test case for combining data integration and high performance computing. The publication was the culmination of years of work that began when they created the Phenoscape Knowledgebase (KB), a project supported by an award from the National Science Foundation (NSF). In 2011, they introduced the KB, the first-of-its-kind semantic resource of computable phenotypes for studies of evolution and genetics. “We created a manually curated knowledgebase with standardized terms,” Mabee explains. The data in the knowledge base could be analyzed alongside genetic data from model organisms. But, there was one problem in the years after KB was introduced: only a few researchers were using this resource. That’s when Mabee decided to turn to KI.
Mabee had already served as a mentor at two KI events: in 2010 at the Innovations in Biological Imaging and Visualization (IBIV) Ideas Lab and in 2011 at the Assembling, Visualizing and Analyzing the Tree of Life (AVAToL) Ideas Lab. Mabee recalls that she was impressed by the interdisciplinary nature of the events, the impact the group activities had on the quality of ideas produced and the communities that were built around the problems. These experiences changed her idea of what was possible to accomplish at a workshop, Mabee says. “It was like having gone to high school plays all of your life and then going to one on Broadway.”
In 2013, when Mabee wanted to challenge her colleagues to use the KB data in new ways, she hired KI to facilitate the Phenoscape Innovation Lab held at the California Academy of Sciences in San Francisco. “We needed to build a community, and we needed creative demonstration cases that would spur innovative uses of newly integrated gene to phenotype data.” The workshop resulted in about half a dozen projects, including one that motivated the development of a software tool called OntoTrace. It allows analysis of the evolution of traits in the knowledgebase. Six years later, Mabee says OntoTrace is “at the foundation of what I’m doing now.”
Using the KB, Mabee and her colleagues combined trait and genetic data from different systems in a demonstration project involving data from commonly used model organisms, like zebrafish and mice. “For the first time, our work validated that you can predict in silico, the candidate genes that are involved in the development of biodiverse traits.” Mabee also points out that researchers can begin to experimentally validate the genes in non-model organisms. For example, they can look for concordance in gene expression patterns or, these days, even using gene editing techniques. “Essentially, the Phenoscape KB is a giant hypothesis generator of candidate genes for the amazing and novel traits of animals.”
Mabee and her colleagues saw the chance to apply these methods to datasets containing human trait and genetic data when the National Institutes of Health put out a call for proposals to help create a cell-level anatomical atlas of the human body. Mabee was lead PI on one of the nine grants awarded and spent a year working with KI and others to establish the foundation for the Human BioMolecular Atlas Program (HuBMAP). She was one of nearly 150 members of the HuBMAP consortium who published a paper in Nature in October 2019 describing the project. (HuBMAP is an ongoing project.)
Harnessing Big Data
In October 2019, Mabee and an interdisciplinary team of researchers were awarded a two-year collaborative NSF award to address the biggest thing that is keeping the KB from being of more use to more researchers: manual data annotation. There are likely thousands of published papers that include some anatomical information about fish fins, but data from only about 50 are available through the KB. “It’s highly limited by manual curation. It’s a current problem across knowledgebases,” says Mabee, “including those for model organisms.”
Now, the solution to this problem may be within reach. In May 2019, Mabee was a participant at a KI event — her first time as a participant since her introduction to KI ten years ago. The goal of NSF’s Harnessing the Data Revolution Ideas Lab was the creation of interdisciplinary teams that would form around innovative solutions to Big Data problems facing modern science. Data integration, Mabee’s speciality, is one of those challenges.
At the Ideas lab event, Mabee’s team proposed to develop artificial neural networks that can use structured knowledge from biology (e.g., ontologies, phylogenies) to automatically detect and predict traits from images of organisms. The hope is, for example, that structures such as fins in the images of fishes (images available from museums and other sources such as iDigBio), can be recognized on the basis of this structured knowledge, without training the neural networks specifically on images of fins.
Image-based trait data derived from this work will feed into knowledgebases such as the KB, allowing a better understanding of patterns of evolution, as well as improving the matching of traits and genes and environment. “We need fuller data to understand how traits and genes change in space over changing geographic climates so we can predict changes that will occur as our climate shifts.”
According to Mabee, there are an estimated 20 million species on the planet; only 1.5 million have been described. In the short run, most of them are very unlikely to ever be described in the traditional way, i.e., via published description. Instead, specimens are likely to be imaged as they are vouchered into collections. The idea, Mabee says, is that semantic trait data can be automatically extracted from these images. “This work would also allow us to document vanishing species before it’s too late.”
A New Scientific Method
Sometimes working with KI impacts the science — and the scientist.
After a decade of attending KI events, Mabee says using the language and tools she has learned working with KI have become second nature. She credits the KI method with transforming her approach to research, teaching and collaboration. She has found the practice of phrasing problems as “How might we…?” questions particularly transformative, regardless of the collaborative settings. She also finds using the PPCo method a refreshing and supportive way of evaluating the ideas of others. (The method involves listing the pluses, potentials and concerns with overcomes for each idea.)
Scientific training forces a skepticism that is sometimes at odds with creativity, Mabee says. “Insisting on talking about the positives and the potentials before the concerns has transformed the way I review articles, the way I communicate on panels and the way I communicate with my students. It’s become an intuitive language for me and I don’t remember myself without it.”