Researchers Use GPT-4 To Generate Feedback on Scientific Manuscripts | Stanford HAI
Stanford
University
  • Stanford Home
  • Maps & Directions
  • Search Stanford
  • Emergency Info
  • Terms of Use
  • Privacy
  • Copyright
  • Trademarks
  • Non-Discrimination
  • Accessibility
© Stanford University.  Stanford, California 94305.
Skip to content
  • About

    • About
    • People
    • Get Involved with HAI
    • Support HAI
    • Subscribe to Email
  • Research

    • Research
    • Fellowship Programs
    • Grants
    • Student Affinity Groups
    • Centers & Labs
    • Research Publications
    • Research Partners
  • Education

    • Education
    • Executive and Professional Education
    • Government and Policymakers
    • K-12
    • Stanford Students
  • Policy

    • Policy
    • Policy Publications
    • Policymaker Education
    • Student Opportunities
  • AI Index

    • AI Index
    • AI Index Report
    • Global Vibrancy Tool
    • People
  • News
  • Events
  • Industry
  • Centers & Labs
Navigate
  • About
  • Events
  • AI Glossary
  • Careers
  • Search
Participate
  • Get Involved
  • Support HAI
  • Contact Us

Stay Up To Date

Get the latest news, advances in research, policy work, and education program updates from HAI in your inbox weekly.

Sign Up For Latest News

news

Researchers Use GPT-4 To Generate Feedback on Scientific Manuscripts

Date
October 26, 2023
Topics
Machine Learning

Combining a large language model and open-source peer-reviewed scientific papers, researchers at Stanford built a tool they hope can help other researchers polish and strengthen their drafts.

Scientific research has a peer problem. There simply aren’t enough qualified peer reviewers to review all the studies. This is a particular challenge for young researchers and those at less well-known institutions who often lack access to experienced mentors who can provide timely feedback. Moreover, many scientific studies get “desk rejected” — summarily denied without peer review.

Sensing a growing crisis in an era of increasing scientific study, AI researchers at Stanford University have used the large language model GPT-4 and a dataset of thousands of previously published papers — replete with their reviewer comments — to create a tool that can “pre-review” draft manuscripts.

“Our hope is that researchers can use this pipeline to improve their drafts prior to official submission to conferences and journals,” said James Zou, an assistant professor of biomedical data science at Stanford and a member of the Stanford Institute for Human-Centered AI (HAI). Zou is the senior author of the study, recently published on preprint service arXiv.

Numbers Don’t Lie

The researchers began by comparing comments made by a large language model against those of human peer reviewers. Fortunately, one of the foremost scientific journals, Nature, and its fifteen sub-journals (Nature Medicine, etc.), not only publishes hundreds of studies a year but includes reviewer comments for some of those papers. And Nature is not alone. The International Conference on Learning Representations (ICLR) does the same with all papers — both accepted and rejected — for its annual machine learning conference.

“Between the two, we curated almost 5,000 peer-reviewed studies and comments to compare with GPT-4’s generated feedback,” Zou says. “The model did surprisingly well.”

The numbers resemble a Venn diagram of overlapping comments. Among the 3,000 or so Nature-family papers in the study, there was intersection between GPT-4 and human comments of almost 31 percent. For ICLR, the numbers were even higher, almost 40 percent of comments by GPT-4 and humans overlapped. What’s more, when looking only at the ICLR’s rejected papers (i.e., less mature papers) the overlap in comments between GPT-4 and humans grew to almost 44 percent — nearly half of all GPT-4 and human comments overlapped.

The significance of these numbers comes into sharper focus in light of the fact that even among humans there is considerable variation among comments by any given paper’s multiple reviewers. Human-to-human overlap was 28 percent for Nature journals and about 35 percent for ICLR. By these metrics, GPT-4 performed comparably to humans.

But while computer-to-human comparisons are instructive, the real test is whether the reviewed paper’s authors valued the comments provided by either review method. Zou’s team conducted a user study where researchers from over 100 institutions submitted their papers, including many preprints, and received GPT-4’s comments. More than half of the participating researchers found GPT-4 feedback “helpful/very helpful” and 82 percent found it “more beneficial” than certain feedback from some human reviewers.

Limits and Horizons

There are caveats to the approach, Zou is quick to highlight in the paper. Notably, GPT-4’s feedback can sometimes be more “generic” and may not pinpoint the deeper technical challenges in the paper. GPT-4 also has the tendency to focus only on limited aspects of scientific feedback (i.e., “add experiments on more datasets”) and comes up short on in-depth Insights on the authors’ methods.

Zou was further careful to emphasize that the team is not suggesting that GPT-4 take the “peer” out of peer review and replace human review. Human expert review “is and should continue to be” the basis of rigorous science, he asserts.

“But we believe AI feedback can benefit researchers in early stages of their paper writing, particularly when considering the growing challenges of getting timely expert feedback on drafts,” Zou concludes. “In that light, we think GPT-4 and human feedback complement one another quite well.”

Stanford HAI’s mission is to advance AI research, education, policy and practice to improve the human condition. Learn more. 

Share
Link copied to clipboard!
Contributor(s)
Andrew Myers

Related News

AI Coding Agents Fail at Teamwork
Andrew Myers
Jun 01, 2026
News
illustration of two people paddling in opposite directions

Two models working together perform worse than one alone, exposing a critical gap in artificial intelligence capabilities.

News
illustration of two people paddling in opposite directions

AI Coding Agents Fail at Teamwork

Andrew Myers
Generative AIMachine LearningJun 01

Two models working together perform worse than one alone, exposing a critical gap in artificial intelligence capabilities.

AI Hiring Tools Can Yield Racial Bias and Systemic Rejection
Rishi Bommasani, Sarah H. Bana, Kathleen A. Creel, Dan Jurafsky, Percy Liang
May 26, 2026
News
A 3D isometric conceptual illustration showing a single glowing yellow human icon standing out among a grid of identical blue figures

The first large-scale study of hiring algorithms in the wild finds concerning patterns to how systems reject candidates.

News
A 3D isometric conceptual illustration showing a single glowing yellow human icon standing out among a grid of identical blue figures

AI Hiring Tools Can Yield Racial Bias and Systemic Rejection

Rishi Bommasani, Sarah H. Bana, Kathleen A. Creel, Dan Jurafsky, Percy Liang
Machine LearningEthics, Equity, InclusionWorkforce, LaborMay 26

The first large-scale study of hiring algorithms in the wild finds concerning patterns to how systems reject candidates.

5 Questions for Russell Wald
Politico
May 08, 2026
Media Mention

HAI Executive Director Russell Wald talks about the AI competition between the U.S. and China, and the advent of “world models” that predict what might happen in real-world environments.

Media Mention
Your browser does not support the video tag.

5 Questions for Russell Wald

Politico
Regulation, Policy, GovernanceMachine LearningComputer VisionMay 08

HAI Executive Director Russell Wald talks about the AI competition between the U.S. and China, and the advent of “world models” that predict what might happen in real-world environments.