On February 14, a researcher who was frustrated with reproducing the results of a machine learning research paper opened up a Reddit account under the username ContributionSecure14 and posted the r/MachineLearning subreddit: “I just spent a week implementing a paper as a baseline and failed to reproduce the results. I realized today after googling for a bit that a few others were also unable to reproduce the results. Is there a list of such papers? It will save people a lot of time and effort.”
The post struck a nerve with other users on r/MachineLearning, which is the largest Reddit community for machine learning.
“Easier to compile a list of reproducible ones…,” one user responded.
“Probably 50%-75% of all papers are unreproducible. It’s sad, but it’s true,” another user wrote. “Think about it, most papers are ‘optimized’ to get into a conference. More often than not the authors know that a paper they’re trying to get into a conference isn’t very good! So they don’t have to worry about reproducibility because nobody will try to reproduce them.”
A few other users posted links to machine learning papers they had failed to implement and voiced their frustration with code implementation not being a requirement in ML conferences.
The next day, ContributionSecure14 created “Papers Without Code,” a website that aims to create a centralized list of machine learning papers that are not implementable.
“I’m not sure if this is the best or worst idea ever but I figured it would be useful to collect a list of papers which people have tried to reproduce and failed,” ContributionSecure14 wrote on r/MachineLearning. “This will give the authors a chance to either release their code, provide pointers or rescind the paper. My hope is that this incentivizes a healthier ML research culture around not publishing unreproducible work.”
Reproducing the results of machine learning papers
Machine learning researchers regularly publish papers on online platforms such as arXiv and OpenReview. These papers describe concepts and techniques that highlight new challenges in machine learning systems or introduce new ways to solve known problems. Many of these papers find their way into mainstream artificial intelligence conferences such as NeurIPS, ICML, ICLR, and CVPR.
Having source code to go along with a research paper helps a lot in verifying the validity of a machine learning technique and building on top of it. But this is not a requirement for machine learning conferences. As a result, many students and researchers who read these papers struggle with reproducing their results.
“Unreproducible work wastes the time and effort of well-meaning researchers, and authors should strive to ensure at least one public implementation of their work exists,” ContributionSecure14, who preferred to remain anonymous, told TechTalks in written comments. “Publishing a paper with empirical results in the public domain is pointless if others cannot build off of the paper or use it as a baseline.”
But ContributionSecure14 also acknowledges that there are sometimes legitimate reasons for machine learning researchers not to release their code. For example, some authors may train their models on internal infrastructure or use large internal datasets for pretraining. In such cases, the researchers are not at the liberty to publish the code or data along with their paper because of company policy.
“If the authors publish a paper without code due to such circumstances, I personally believe that they have the academic responsibility to work closely with other researchers trying to reproduce their paper,” ContributionSecure14 says. “There is no point in publishing the paper in the public domain if others cannot build off of it. There should be at least one publicly available reference implementation for others to build off of or use as a baseline.”
In some cases, even if the authors release both the source code and data to their paper, other machine learning researchers still struggle to reproduce the results. This can be due to various reasons. For instance, the authors might cherry-pick the best results from several experiments and present them as state-of-the-art achievements. In other cases, the researchers might have used tricks such as tuning the parameters of their machine learning model to the test data set to boost the results. In such cases, even if the results are reproducible, they are not relevant, because the machine learning model has been overfitted to specific conditions and won’t perform well on previously unseen data.
“I think it is necessary to have reproducible code as a prerequisite in order to independently verify the validity of the results claimed in the paper, but [code alone is] not sufficient,” ContributionSecure14 said.
Efforts for machine learning reproducibility
The reproducibility problem is not limited to small machine learning research teams. Even big tech companies that spend millions of dollars on AI research every year often fail to validate the results of their papers. In October 2020, a group of 31 scientists wrote a joint article in Nature, criticizing the lack of transparency and reproducibility in a paper on the use of AI in medical imaging, published by a group of AI researchers at Google. “[The] absence of sufficiently documented methods and computer code underlying the study effectively undermines its scientific value. This shortcoming limits the evidence required for others to prospectively validate and clinically implement such technologies,” the authors wrote. “Scientific progress depends on the ability of independent researchers to scrutinize the results of a research study, to reproduce the study’s main results using its materials, and to build on them in future studies.”
Recent years have seen growing focus on AI’s reproducibility crisis. Notable work in this regard includes the efforts of Joelle Pineau, machine learning scientist at Montreal’s McGill University and Facebook AI, who has been pushing for transparency and reproducibility of machine learning research at conferences such as NeurIPS.
“Better reproducibility means it’s much easier to build on a paper. Often, the review process is short and limited, and the true impact of a paper is something we see much later. The paper lives on, and as a community we have a chance to build on the work, examine the code, have a critical eye to what are the contributions,” Pineau told Nature in an interview in 2019.
At NeurIPS, Pineau has helped develop standards and processes that can help researchers and reviewers evaluate the reproducibility of machine learning papers. Her efforts have resulted in an increase in code and data submission at NeurIPS.
Another interesting project is Papers With Code (where Papers Without Code gets its name from), a website that provides implementations for scientific research papers published and presented at different venues. Papers With Code currently hosts the implementation of more than 40,000 machine learning research papers.
“PapersWithCode plays an important role in highlighting papers that are reproducible. However, it does not address the problem of unreproducible papers,” ContributionSecure14 said.
When a machine learning research paper doesn’t include the implementation code, other researchers who read it must try to implement it by themselves, a non-trivial process that can take several weeks and ultimately result in failure.
“If they fail to implement it successfully, they might reach out to the authors (who may not respond) or simply give up,” ContributionSecure14 said. “This can happen to multiple researchers who are not aware of prior or ongoing attempts to reproduce the paper, resulting in many weeks of productivity wasted collectively.”
Papers Without Code
Papers Without Code includes a submission page, where researchers can submit unreproducible machine learning papers along with the details of their efforts, such as how much time they spent trying to reproduce the results. If a submission is valid, Papers Without Code will contact the paper’s original authors and request clarification or publication of implementation details. If the authors do not reply in a timely fashion, the paper will be added to the list of unreproducible machine learning papers.
“PapersWithoutCode solves the problem of centralizing information about prior or ongoing attempts to reproduce a paper and allows researchers (including the original author) to come together and implement a public implementation,” ContributionSecure14 said. “Once the paper has been successfully reproduced, it can be published on PapersWithCode or GitHub where other researchers can use it. In that sense, I would say the goals of PapersWithoutCode are synergistic with that or PapersWithCode and the ML community at large.”
The hope is that Papers Without Code will help establish a culture that incentivizes reproducibility in machine learning research. So far, the website has received more than 10 requests and one author has already pledged to upload their code.
“I realize that this can be a controversial subject in academia and the top priority is to protect the authors’ reputation while serving the broader ML community,” ContributionSecure14 said.
Papers Without Code can become a hub for creating dialogue between the original authors of machine learning papers and researchers who are trying to reproduce their work.
“Instead of being a static list of unreproducible work, the hope is to create an environment where researchers can collaborate to reproduce a paper,” ContributionSecure14 said.
Reproducible machine learning research
Creating a culture of reproducible machine learning work will require the concerted efforts of all researchers and scientists.
For instance, if you’re working on a research building on the work done in another paper, you should try out the code or the machine learning model yourself.
“Don’t build off of claims or ‘insights’ that could potentially be unfounded just because the paper says so,” ContributionSecure14 says, adding that this includes papers from large labs or work that has been accepted in a reputable conference.
Another good resource is professor Pineau’s “Machine Learning Reproducibility Checklist.” The checklist provides clear guidelines on how to make the description, code, and data of a machine learning paper clear and reproducible for other researchers.
ContributionSecure14 believes that machine learning researchers can play a crucial role in promoting a culture of reproducibility.
“There is a lot of pressure to publish at the expense of academic depth and reproducibility and there are not many checks and balances to prevent this behavior,” ContributionSecure14 said. “The only way this will change is if the current and future generation of ML researchers prioritize quality over quantity in their own research.”