Georgetown University biologist Colin Carlson is concerned about mousepox.
The virus, discovered in 1930, spreads to rats and kills them mercilessly. But scientists have never considered it a potential threat to humans. Now Dr. Carlson, his colleagues and their computers are not so certain.
Using a technique called machine learning, researchers have spent the last few years programming computers to teach themselves about viruses that can infect human cells. Computers have acquired vast amounts of information about the biology and ecology of the animal hosts of the virus, as well as about the genomes and other features of the virus. Over time, computers have been able to identify certain factors that will predict whether the virus has the potential to spread to humans.
Once computers proved their expertise on viruses, which scientists have already studied intensively, Dr. Carlson and his colleagues deployed them on strangers, eventually compiling a short list of animal viruses with the possibility of jumping off the species barrier and causing a human outbreak.
In the latest run, algorithms unexpectedly place the mousepox virus in the top ranks of dangerous pathogens.
“Whenever we run this model, it comes very high,” said Dr. Said Carlson.
In confusion, Dr. Carlson and his colleagues have their roots in the scientific literature. They arrived in rural China in 1987 on documents of a long-forgotten outbreak. The school children came down from the infection which caused sore throat and inflammation in the hands and feet.
Years later, a team of scientists performed tests on throat swabs that were collected during the outbreak and placed in storage. These samples, as the group reported in 2012, contained mousepox DNA. But their study received little attention, and a decade later mousepox is still not considered dangerous to humans.
If the computer programmed Dr. Carlson and his colleagues are right, the virus deserves a new look.
“It’s just crazy to get lost in this huge pile of stuff from which public health has to be found,” he said. “This actually changes the way we think about this virus.”
Scientists have identified about 250 human diseases that arose when the animal virus jumped off the barrier of the species. HIV, for example, jumped out of the chimpanzee, and the new coronavirus originated in bats.
Ideally, scientists should start infecting people before they want to identify the next spillover virus. But there are too many animal viruses for virologists to study. Scientists have identified more than 1,000 viruses in mammals, but that is probably a fraction of the true number. Some researchers suspect that mammals carry thousands of viruses, while others put the number in the hundreds of thousands.
To identify potential new spillovers, researchers such as Drs. Carlson is using a computer to find patterns hidden in scientific data. Machines can zero in on viruses that can give rise to a particular human disease, for example, and also predict which animals are likely to harbor a dangerous virus that we do not yet know about.
Dr. Carlson. “You can’t just look at the dimensions of the model.”
Dr. Han first came to machine learning in 2010. Computer scientists have been developing this technique for decades, and are beginning to build powerful tools with it. These days, machine learning enables computers to detect fraudulent credit charges and identify people’s faces.
But few researchers have used machine learning for diseases. Dr. Hahn wondered if he could use it to answer open-ended questions, such as why less than 10 percent of mouse species harbor known pathogens to infect humans.
She fed computer information about different species of rats from online databases – everything from their age to their population density. The computer then discovered the characteristics of rats that are known to harbor a large number of species-jumping pathogens.
Once the computer created a model, she tested it against another group of mouse species, seeing how well she could guess which species were full of pathogenic agents. Eventually, the computer model reached 90 percent accuracy.
Then Dr. Han turned to rats that have yet to be tested for spillover pathogens and compiled a list of high-priority species. Dr. Han and his colleagues specifically predicted that species such as the Mountain Wall and the northern grasshopper mouse in western North America would be likely to carry worrying pathogens.
All the features Han and his colleagues provided to his computer, the most important of which was the life of the rats. Species that die young carry more pathogens, probably because evolution has put more of their resources into reproduction than to build a strong immune system.
These results include years of enterprising research in which Drs. Han and his colleagues combed through ecological databases and scientific studies in search of useful data. More recently, researchers have accelerated this task by creating a database designed specifically to teach computers about viruses and their hosts.
In March, for example, Dr. Carlson and colleagues unveiled an open-access database called VIRION, which has collected half a million pieces of information about 9,521 viruses and their 3,692 animal hosts – and is still growing.
Databases like VIRION now make it possible to ask more focused questions about new epidemics. When the Kovid epidemic struck, it soon became clear that it was caused by a new virus called SARS-CoV-2. Dr. Carlson, Dr. Han and his colleagues created programs to identify animals that are likely to harbor relatives of the new coronavirus.
SARS-CoV-2 belongs to a group of species called betacoronaviruses, including viruses that cause SARS and MERS epidemics in humans. For the most part, the beta-coronavirus infects bats. When SARS-CoV-2 was discovered in January 2020, 79 species of bats were known to carry them.
But scientists have not systematically discovered all 1,447 species of bats for betacoronavirus, and such a project would take many years to complete.
By feeding biological data about different types of bats – their diet, the length of their wings, etc. – into their computers, Drs. Carlson, Dr. Han and his colleagues developed a model that could give the betacoronavirus the potential to make predictions about bats. They found more than 300 species that fit the bill.
Since that prediction in 2020, researchers have actually detected betacoronavirus in 47 species of bats – all of which were on the forecast list produced by some of the computer models they created for their study.
Daniel Baker, a disease ecologist at the University of Oklahoma who also worked on the beta coronavirus study, said it was shocking how simple features such as body size could lead to powerful predictions about the virus. “A lot of it is the low hanging fruit of comparative biology,” he said.
Dr. Baker now follows a list of potential betacoronavirus hosts from his own backyard. It turns out that some bats in Oklahoma predict to give them shelter.
If Dr. Baker gets backyard betacoronavirus, he may not be in a position to say immediately that it is an imminent threat to humans. Scientists will first have to do enterprising experiments to determine the risk.
Dr. Pranav Pandit, an epidemiologist at the University of California at Davis, warns that these models are very much in progress. When tested on well-studied viruses, they do significantly better than random chance, but may do better.
“It’s not a stage where we can just take those results and start warning the world to start saying, ‘This is a zoonotic virus’,” he said.
Nardus Molentz, a computational virologist at the University of Glasgow, and his colleagues have pioneered a method that could significantly increase the accuracy of models. Instead of looking at the hosts of the virus, their model looks at its genes. Computers can be taught to identify microscopic features in the genes of viruses that can infect humans.
In his first report on this technology, Dr. Molentz and his colleagues developed a model that could accurately identify human-infected viruses more than 70 percent of the time. Dr. Molentz can’t say yet why his gene-based model works, but he has some ideas. Our cells can recognize foreign genes and send alarms to the immune system. Viruses that can infect our cells may have the ability to mimic our own DNA as a kind of viral camouflage.
When they applied this model to animal viruses, they came up with a list of 272 species that are at high risk of extinction. That’s a lot for virologists to study in any depth.
“You can only work on a lot of viruses,” said Amy de Vite, a virologist at Rocky Mountain Laboratories in Hamilton, Montgomery, who oversees research on new coronaviruses, influenza and other viruses. “At our end, we’ll really need to narrow it down.”
Dr. Molentz acknowledged that he and his colleagues needed to find a way to pinpoint the worst of the worst viruses. “This is just the beginning,” he said.
To follow his initial study, Dr. Molentz is working with Dr. Carlson and colleagues sought to merge data about virus genes with data related to their hosts’ biology and ecology. Researchers are seeing some promising results from this approach, including the tantalizing mousepox lead.
Other types of data can make predictions better. One of the most important characteristics of a virus, for example, is the coating of sugar molecules on its surface. Different viruses end up with different patterns of sugar molecules, and that arrangement can have a huge impact on their success. Some viruses may use this molecular frosting to disguise their host’s immune system. In other cases, the virus may use its sugar molecules to infect new cells, triggering new infections.
This month, Dr. Carlson and colleagues posted a comment online stating that machine learning can gain a lot of insights from the virus and the sugar coating of their hosts. Scientists have already accumulated a lot of that knowledge, but it has yet to be put into a form from which computers can learn.
“My gut understanding is that we know a lot more than we think,” said Dr. Said Carlson.
Dr. De Vite said the machine learning model could one day guide a virologist like himself to study certain animal viruses. “There’s definitely a big advantage that’s coming from this,” she said.
But she noted that the models so far have focused primarily on the potential for pathogens to infect human cells. Before a new human disease develops, the virus also spreads from person to person and causes serious symptoms along the way. It awaits a new generation of machine learning models that can make predictions.
“What we really want to know is not which viruses can infect humans, but which viruses can cause outbreaks,” she said. “So that’s really the next step we need to find.”