Skip to main content Skip to secondary navigation
Page Content

Researchers Use GPT-4 To Generate Feedback on Scientific Manuscripts

Combining a large language model and open-source peer-reviewed scientific papers, researchers at Stanford built a tool they hope can help other researchers polish and strengthen their drafts.

Man working with piles of unfinished manuscript files

Scientific research has a peer problem. There simply aren’t enough qualified peer reviewers to review all the studies. This is a particular challenge for young researchers and those at less well-known institutions who often lack access to experienced mentors who can provide timely feedback. Moreover, many scientific studies get “desk rejected” — summarily denied without peer review.

Sensing a growing crisis in an era of increasing scientific study, AI researchers at Stanford University have used the large language model GPT-4 and a dataset of thousands of previously published papers — replete with their reviewer comments — to create a tool that can “pre-review” draft manuscripts.

“Our hope is that researchers can use this pipeline to improve their drafts prior to official submission to conferences and journals,” said James Zou, an assistant professor of biomedical data science at Stanford and a member of the Stanford Institute for Human-Centered AI (HAI). Zou is the senior author of the study, recently published on preprint service arXiv.

Numbers Don’t Lie

The researchers began by comparing comments made by a large language model against those of human peer reviewers. Fortunately, one of the foremost scientific journals, Nature, and its fifteen sub-journals (Nature Medicine, etc.), not only publishes hundreds of studies a year but includes reviewer comments for some of those papers. And Nature is not alone. The International Conference on Learning Representations (ICLR) does the same with all papers — both accepted and rejected — for its annual machine learning conference.

“Between the two, we curated almost 5,000 peer-reviewed studies and comments to compare with GPT-4’s generated feedback,” Zou says. “The model did surprisingly well.”

The numbers resemble a Venn diagram of overlapping comments. Among the 3,000 or so Nature-family papers in the study, there was intersection between GPT-4 and human comments of almost 31 percent. For ICLR, the numbers were even higher, almost 40 percent of comments by GPT-4 and humans overlapped. What’s more, when looking only at the ICLR’s rejected papers (i.e., less mature papers) the overlap in comments between GPT-4 and humans grew to almost 44 percent — nearly half of all GPT-4 and human comments overlapped.

The significance of these numbers comes into sharper focus in light of the fact that even among humans there is considerable variation among comments by any given paper’s multiple reviewers. Human-to-human overlap was 28 percent for Nature journals and about 35 percent for ICLR. By these metrics, GPT-4 performed comparably to humans.

But while computer-to-human comparisons are instructive, the real test is whether the reviewed paper’s authors valued the comments provided by either review method. Zou’s team conducted a user study where researchers from over 100 institutions submitted their papers, including many preprints, and received GPT-4’s comments. More than half of the participating researchers found GPT-4 feedback “helpful/very helpful” and 82 percent found it “more beneficial” than certain feedback from some human reviewers.

Limits and Horizons

There are caveats to the approach, Zou is quick to highlight in the paper. Notably, GPT-4’s feedback can sometimes be more “generic” and may not pinpoint the deeper technical challenges in the paper. GPT-4 also has the tendency to focus only on limited aspects of scientific feedback (i.e., “add experiments on more datasets”) and comes up short on in-depth Insights on the authors’ methods.

Zou was further careful to emphasize that the team is not suggesting that GPT-4 take the “peer” out of peer review and replace human review. Human expert review “is and should continue to be” the basis of rigorous science, he asserts.

“But we believe AI feedback can benefit researchers in early stages of their paper writing, particularly when considering the growing challenges of getting timely expert feedback on drafts,” Zou concludes. “In that light, we think GPT-4 and human feedback complement one another quite well.”

Stanford HAI’s mission is to advance AI research, education, policy and practice to improve the human condition. Learn more

More News Topics