This article is part of our reviews of AI research papers, a series of posts that explore the latest findings in artificial intelligence.
With the past decade seeing renewed interest in artificial intelligence, research and publication in the field has grown immensely. And while publishing an AI paper online is not very difficult, it is acceptance and presentation at conferences such as NeurIPS, ICLR, CVPR, and ICML that give needed credit and exposure to the work done by researchers.
Organizers of AI conferences face the mounting challenge of having many more submissions than they have space for. As a result, a small percentage of submitted papers make it to mainstream AI conferences.
The question is, are AI conferences choosing the best research for presentation? A new paper published on OpenReview and submitted to ICLR 2021 investigates the quality of the AI conference review process. Titled “An Open Review of OpenReview,” the document reveals some of the flaws of the review process for machine learning conferences, including inconsistent scoring, institutional bias, and lower acceptance rates for female researchers.
The AI conference review process
The paper focuses on the ICLR review process, though other conferences use similar methodologies. Each AI paper submitted to the conference goes through several steps before being accepted or rejected. The process involves, reviewers, area chairs, and program chairs.
Area chairs are experts that have experience in specific domains, such as computer vision and natural language processing. They do not review individual papers but control the process. They guide reviewers, moderate discussions, and make recommendations based on the feedback they get from authors and reviewers.
Reviewers are the people who work on individual papers. Each paper is assigned to several reviewers, who read it in full and verify the code and data that comes with the paper to make sure the findings are valid and reproducible. They correspond with the author and the area chairs to clarify questions that need to be answered, and finally, they give their final recommendations on whether a paper should be accepted or rejected.
Program chairs are senior scientists and experts that make high-level decisions, including the final decision on which papers get rejected or accepted. They can also intervene in the review process if needed.
Experts can apply to become reviewers and area chairs at ICLR, but most are invited by the organizers.
The entire process is registered on the OpenReview platform, where everyone can see the feedback given by reviewers, authors, area chairs, and program chairs. OpenReview also accepts public comments from people who are not involved in the review process.
Investigating the AI research paper review process
The authors of the paper, who remain anonymous as of this writing, were keen on answering the question: Do conference reviews consistently identify high-quality work? Or has review degenerated into censorship?
To make an informed decision, they used public information registered in OpenReview, including titles, abstracts, authors lists, emails, scores, and reviews for more than 5,000 papers submitted to ICLR conferences from 2017 to 2020. The authors obtained additional information about withdrawn papers after communicating with the administrators of OpenReview.
The researchers used different methods to find additional information, such as the impact of papers and authors on other AI research work. They also analyzed the paper data to determine the topic of each paper.
Is the review process reproducible?
The first question the researchers answered was whether there was any amount of randomness involved in the acceptance or rejection of AI research papers. This means that, for instance, if an accepted paper went through the review process again, would it be accepted again?
To perform this test, the researchers created a regression model from paper scores and the final decision made by the area chairs. Then they simulated the process by generating scores the fit the threshold and checked to see how many times the model reproduced the same decision on the paper. They ran the simulation separately for the papers submitted to each year, and the results were disappointing.
“We observe a downward trend in reproducibility, with scores decreasing from 75% in 2017, to 70% in 2018 and 2019, to 66% in 2020,” the authors write.
Even when they increased the number of reviewers in their simulation, the gains were not significant. “While reproducibility scores increase with more reviewers, gains are marginal; increasing the number of reviewers from 2 to 5 boosts reproducibility by just 3%,” the authors write. “As more reviewers are added, the high level of disagreement among area chairs remains constant, while the standard error in mean scores falls slowly. The result is paltry gains in reproducibility.”
To improve the process, the authors recommend to use a small number of reviewers and ad ad-hoc reviewers where first-round reviews are uninformative.
What is the impact of accepted AI research?
Papers accepted at major AI conferences should help advance future artificial intelligence research. In their work, the authors of the OpenReview paper analyzed the impact of both accepted and rejected AI research papers. “We measure impact using citation rate, calculated by dividing citation count by the number of days since the paper was first published online,” they write.
In other words, impactful papers should receive more citations from other AI researchers. Naturally, papers accepted and presented at AI conferences gain more exposure and by extension more citations than rejected work. But when comparing accepted papers among themselves, the researchers found a very small correlation between citation and review scores.
“Our dataset shows that reproducibility scores, correlations with impact, and reviewer agreement have all gone down over the years,” the authors write.
Institution and reputation bias in the AI review process
The authors of the paper were also interested in knowing whether there was a preference for AI research done at prestigious academic institutions or tech companies. “We found that 85% of papers across all years (87% in 2020) had at least one academic author,” the authors write, adding that this finding per se does not imply bias toward and can fairly be related to the quality of research done at these institutions.
To further investigate, they “controlled for paper score,” which means they verified whether area chairs showed any bias when deciding on papers that had similar scores.
“We found that, even after controlling for reviewer scores, being a top ten institution leads to a boost in the likelihood of getting accepted,” the authors write. This bias remained even when the researcher only considered the affiliation of the last author of the AI papers (last authors are usually the senior person who oversees the research or heads the lab where the research takes place).
There was a significant preference for Carnegie Mellon, MIT, and Cornell universities, the authors found. Another interesting point that confirms this finding is that author identity visibility improves its review score. The authors of an AI paper submitted to OpenReview remain anonymous to reviewers while it is going through the review process. But at the same time, the researchers can publish their paper in other mediums, such as the arXiv preprint server. The researchers found that AI papers appearing on arXiv before the review process tended to do better, especially if the researchers were associated with CMU, MIT, or Cornell.
The authors of the OpenReview analysis also found that, all things equal, area chairs generally preferred submissions authored by highly reputed individuals.
Interestingly, their research did not find a significant bias toward large tech companies such as Google, Facebook, and Microsoft, which house reputable AI researchers. At first glance, this is a positive finding, because big tech already has a vast influence over commercial AI and, by extension, on AI research.
But as other authors have pointed out, the same academic institutions that are very well represented at AI conferences serve as talent pools for big tech companies and receive much of their funding from those same organizations. So this just creates a feedback loop of a narrow group of people promoting each other’s work and hiring each other at the expense of others.
Also concerning is the under-representation of women in AI conferences. The gender gap is a well-known problem in the AI community (and the tech community in general), and it carries over to the paper review process. “We observe a gender gap in the review process, with women first authors achieving a lower acceptance rate than men (23.5% vs 27.1%),” the authors of the OpenReview analysis observe in their paper.
The researchers add that while in 2019, women made up 23.2 percent of all computer science PhD students, only 10.6 percent of publications at ICLR 2020 had a female first author (the first author is the lead contributor of the AI research). Meanwhile, women constitute 22.6 percent of computer science faculty in the U.S. but only 9.7 percent of senior authors at ICLR.
“We observe a gender gap in the review process, with women first authors achieving a lower acceptance rate than men (23.5% vs 27.1%),” the authors of the analysis add.
Why is this important?
“When the number of papers with merit is greater than the number of papers that will be accepted, it is inevitable that decisions become highly subjective and that biases towards certain kinds of papers, topics, institution, and authors become endemic,” the authors conclude in their paper.
As artificial intelligence continues to gain prominence in different areas of life, it is important that it becomes inclusive and represents all different demographics. AI conferences serve as hubs to draw attention to influential work that can have a great impact on future applications and research. When the review process favors one group of people, the effects can ripple to other areas that benefit from AI research, including health care, finance, hiring, and the justice system.
AI conference organizers should certainly reconsider the review process and take measures to ensure all researchers get a fair shot at having their work presented at prestigious venues. The work also has a message for the rest of us, the people who are following and covering AI research. We should all be aware that while AI conferences are a good criterion for where the community is headed, there is plenty of good research that never find their way into the spotlight. I’ve had some great experience unearthing some of these gems while perusing the machine learning subreddit, the AI and deep learning Facebook group, and creating my own list of AI researchers and computer scientists to follow on Twitter. I’m proud to have been able to give some of these researchers the exposure their work deserves. I’m sure others can do even better than me.