Blog

AI is helping scientists navigate thousands of COVID-19 research papers

April 6, 2020

science lab experiment — AI researchers are joining forces to create programs that can help medical scientists navigate thousands of research papers written about COVID-19 (Image credit: Depositphotos)

This article is part of our ongoing coverage of the fight against coronavirus.

As the world unites in the fight against COVID-19, scientists and researchers around the world are studying the novel coronavirus and publishing their findings in peer-reviewed journals and pre-print servers.

Scattered across these research papers might be the pieces of the puzzle that will unlock the cure or vaccine for COVID-19 or new ways to treat patients and prevent the spread of the virus. Unfortunately, no single person can go through tens of thousands of documents, and the thousands more that are being added every week.

This is where the artificial intelligence community enters the scene. Among other efforts to help fight the coronavirus pandemic, AI researchers are fast busy developing tools that will help medical scientists navigate the fast-growing corpus of literature surrounding coronavirus.

The concerted effort to process COVID-19 papers, which has brought together government agencies, tech giants, universities, and research labs, will be a measure of how useful our state-of-the-art AI algorithms have become.

The CORD-19 dataset

In March, the U.S. government teamed up with tech giants Microsoft and Google to gather research papers about COVID-19. The corpus was compiled into a dataset named COVID-19 Open Research Dataset (CORD-19) by the Allen Institute for AI (AI2) in partnership with the Chan Zuckerberg Initiative, Georgetown University’s Center for Security and Emerging Technology, Microsoft Research, and the National Library of Medicine at National Institutes of Health, in coordination with The White House Office of Science and Technology Policy.

CORD-19 was released in mid-March and made accessible to AI researchers to use it to create machine learning models that can help scientists find the information they need.

The initial dataset included over 24,000 research papers from peer-reviewed publications as well as pre-print servers such as bioRxiv and medRxiv. It has since grown to more than 47,000 documents since.

CORD-19 is available on AI2’s Semantic Scholar website, a search engine for peer-reviewed research. Machine learning researchers can download the database from Semantic Scholar. The corpus has also been integrated into the search engine and can be queried through Semantic Scholar.

AI2 has also launched the CORD-19 Explorer, a full-text search engine specialized for the COVID-19 research corpus. The Explorer also has links to other relevant tools. Some of them have been built on CORD-19, such as this search engine that uses Microsoft Azure’s Cognitive Search. Other tools are based on other data sources, such as the Elsevier Coronavirus Research Repository. You’ll also find a link to COVID-19 Cognitive City, a social network focused on stopping the spread of coronavirus.

The Kaggle challenge

coronavirus (covid-19) — Image credit: Depositphotos

Semantic Scholar and Google Scholar, which also consolidates relevant research papers, are already powerful tools for searching the corpus of knowledge generated on COVID-19. Semantic Scholar uses transformers, the state-of-the-art in natural language processing (NLP). Google has also added BERT, an implementation of transformers, in a recent update to its search engine.

The community, however, is interested to know if they can push the limits of current AI algorithms and exploit them to further help scientists in their fight against COVID-19.

Following the release of CORD-19, Kaggle, the Google-owned hub for data science and machine learning competitions, launched the COVID-19 Open Research Dataset Challenge. “We are issuing a call to action to the world’s artificial intelligence experts to develop text and data mining tools that can help the medical community develop answers to high priority scientific questions,” the challenge’s description reads.

To be able to measure progress and success, the challenge has been broken down into a list of 10 tasks that can help better understand new information about COVID-19, patient care, and cure development.

For instance, one task involves non-pharmaceutical interventions. The AI that tackles this task should be able to peruse the dataset and find papers that discuss NPIs and their effectiveness, such as how travel bans and school closures are helping in flattening the COVID-19 curve. Another task involves gathering the latest findings on COVID-19 risk factors.

Results should include complementary information such as the strength of the evidence found in the studies, which can help in the decision-making process.

“Findings should be focused, concise, extract quotes and numbers out of papers and also provide a link to the underlying source,” Anthony Goldbloom, Kaggle’s CEO, has written in an advisory on the CORD-19 challenge.

As of this writing, there have been more than 730 contributors to the CORD-19 Challenge.

Where does AI technology stand today

The tasks included in the CORD-19 Challenge are very practical tasks, and the results will directly affect our response to the coronavirus pandemic. But one thing to note is that we can’t expect miracles from contemporary artificial intelligence technologies.

Language processing is perhaps the most challenging subfield of AI and the most complex functions of the human brain, the one thing that sets us from other living beings. According to many experts, the problem of language processing will remain unsolved until we create artificial general intelligence, the kind of AI that has human-level abstraction, reasoning, and problem-solving capabilities. And by many accounts, we are at least decades away from general AI.

For the moment, our most advanced NLP models rely on deep learning and artificial neural networks. Neural networks are very efficient statistical models that can find recurring patterns in large sequences of data. Deep learning models like transformers, now used in most advanced language models, can operate on very large corpora of text and answer queries in ways that were beyond the capabilities of previous artificial intelligence algorithms.

However, when it comes to extracting the implied meanings that are often omitted in written and spoken language, even the most sophisticated AI algorithms struggle. We still don’t have AI that can understand and process human language as efficiently as a seven-year-old child.

But the silver lining is that this particular challenge involves a very narrow field of research. As opposed to general natural language understanding, the CORD-19 Challenge has a very specific requirement: Searching for information about one virus and one disease.

While current AI systems lack in general problem–solving, they’re very good at dealing with narrow domains, often performing even better than humans. In fact, according to Goldbloom, “Some of the most impactful work so far have involved simple methods like string matching and regular expressions.” String-matching and regular expressions are not even considered AI today.

Another factor that provides hope is the quality of the information. One of the challenges of machine learning is gathering and cleaning the data used in training the models. In this case, there’s a concerted effort by the entire community and a lot of manual and automated effort is going into making sure that we have a consolidated body of reliable documents for research.

So we probably can’t expect the emergence of an AI system that can read and understand every document like a human scientist would. Past efforts at creating such AI systems have failed, and there hasn’t been any fundamental breakthrough to show hope for a change in this regard.

But what we can expect is the development of very specialized AI-powered search tools that will help our scientists find relevant bits in the growing sea of information published on COVID-19. As long as you know which questions to ask—and the people using these systems certainly do—you’ll be able to obtain very quality information.

As A12 CEO Oren Etzioni wrote in Wired last week, “While the jury is still out on AI’s contributions in the coming weeks, it’s clear that the AI community has enlisted to fight Covid-19. It is ironic that the AI which has caused such consternation with facial recognition, deepfakes, and such is now at the front lines of helping scientists confront Covid-19 and future pandemics… Our use of AI to fight Covid-19 reminds us that AI is a tool, not a being, and it’s up to us to employ this tool for the common good.”

Moving beyond passive RAG: How to implement active memory reconstruction for…

How self-improving harnesses are rewriting the agent engineering playbook

How Nvidia’s ASPIRE framework accelerates robot programming with self-improving AI

How the AI arms race moved from smart models to full-stack…

Why LLMs should stop thinking out loud (and what comes after…

Applied ML: When ‘perfect’ becomes the enemy of ‘good’

AI can’t replace software engineers yet, but here is how to…

How to turbocharge your product and market research with DeepSearch

How looking differently at data can save your machine learning project

Building a solid data foundation for generative AI applications

Demystifying loop engineering: Get more from AI agents, avoid loopmaxxing

Why the future of agentic AI is all about the harness

The evolution of LLM tool-use from API calls to agentic applications

What makes DeepSeek-V3.2 so efficient?

What to know about Claude Opus 4.5

AI is writing your code, but who’s reviewing it?

Machine learning in space: Building intelligent systems for the harshest environments

Decoding the brain, inspiring AI: How Rahul Biswas is bridging neuroscience…

The cash flow conundrum: How technology is reshaping small business finance

What to know about the security of open-source machine learning models

AI is helping scientists navigate thousands of COVID-19 research papers

The CORD-19 dataset

The Kaggle challenge

Where does AI technology stand today

Like this:

Leave a ReplyCancel reply

The CORD-19 dataset

The Kaggle challenge

Where does AI technology stand today

Like this:

Leave a ReplyCancel reply

Discover more from TechTalks