AI Explores RNA Dark Matter and Finds 70,000 New Viruses
People often associate artificial intelligence with chatbots like ChatGPT, which generates human-like text by predicting the next word based on large datasets. However, a quiet revolution is underway, where scientists are leveraging AI for groundbreaking research and discoveries.
In a recent study, researchers have used AI to uncover more than 160,000 potential RNA virus species, including about 70,000 that have never previously been identified as potentially novel species.
An RNA virus contains either single-stranded or double-stranded RNA as its genetic material. As their name implies, RNA viruses have genomes made of RNA instead of DNA. Some well-known diseases caused by RNA viruses include Ebola, Severe Acute Respiratory Syndrome (SARS), influenza, the common cold, and Hepatitis B and C.
While viruses are the most abundant biological entity on our planet, we have a limited understanding of these infectious agents and the role they play in our world.
A study published in Cell by an international team of researchers marks a major milestone in the discovery of virus species It stands as the largest paper ever released on the discovery of virus species, highlighting significant advancements in our understanding of viral diversity.
"This is the largest number of new virus species discovered in a single study, massively expanding our knowledge of the viruses that live among us," said senior author of the study, Professor Edwards Holmes from the School of Medical Sciences in the Faculty of Medicine and Health at the University of Sydney.
"To find this many new viruses in one fell swoop is mind-blowing, and it just scratches the surface, opening up a world of discovery. There are millions more to be discovered, and we can apply this same approach to identifying bacteria and parasites."
Previous research has utilized machine learning to discover new viruses within sequencing data. However, the latest study takes a step further by focusing on predicting protein structure, which is crucial for understanding viral mechanisms.
The researchers developed a deep learning algorithm called LucaProt to analyze extensive genetic sequence data, including lengthy virus genomes of up to 47,250 nucleotides. LucaProt was trained to process this data and identify viruses by examining their genetic sequences and the secondary structures of proteins essential for RNA virus replication.
Once potential viral sequences were identified, the data was fed into a protein-prediction tool called ESMFold, developed by researchers at Meta. ESMFold predicts the three-dimensional structures of proteins from their amino acid sequences, to reveal how the proteins function. A similar AI system, AlphaFold, developed by Google DeepMind, was recently recognized when its creators were awarded a Nobel Prize in Chemistry last week.
Many of the viruses studied in RNA virus research had already been sequenced and were available in public databases. However, they were so genetically diverse that their identities didn't fit into known categories or classifications. This is referred to as the genetic ‘dark matter’.
The researchers trained their AI model to analyze this dark matter and identify viruses by examining both their sequences and the secondary structures of proteins utilized in RNA virus replication. The AI tool was instrumental in helping fast-track virus discovery, which would be time-consuming using traditional methods, such as manual sequencing and analysis.
Co-author from Sun Yat-sen University, the study's institutional lead, Professor Mang Shi said: "We used to rely on tedious bioinformatics pipelines for virus discovery, which limited the diversity we could explore. Now, we have a much more effective AI-based model that offers exceptional sensitivity and specificity, and at the same time allows us to delve much deeper into viral diversity. We plan to apply this model across various applications."
According to Professor Holmes, the next phase involves further training the AI method to discover additional viruses and gain insights into their ecological roles. Holmes, Shi, and their team have released LucaProt for public access, enabling fellow researchers to leverage the tool in their efforts to identify new RNA viruses within their datasets.