Stanford
University
  • Stanford Home
  • Maps & Directions
  • Search Stanford
  • Emergency Info
  • Terms of Use
  • Privacy
  • Copyright
  • Trademarks
  • Non-Discrimination
  • Accessibility
© Stanford University.  Stanford, California 94305.
Researchers Use GPT-4 To Generate Feedback on Scientific Manuscripts | Stanford HAI

Stay Up To Date

Get the latest news, advances in research, policy work, and education program updates from HAI in your inbox weekly.

Sign Up For Latest News

Navigate
  • About
  • Events
  • Careers
  • Search
Participate
  • Get Involved
  • Support HAI
  • Contact Us
Skip to content
  • About

    • About
    • People
    • Get Involved with HAI
    • Support HAI
    • Subscribe to Email
  • Research

    • Research
    • Fellowship Programs
    • Grants
    • Student Affinity Groups
    • Centers & Labs
    • Research Publications
    • Research Partners
  • Education

    • Education
    • Executive and Professional Education
    • Government and Policymakers
    • K-12
    • Stanford Students
  • Policy

    • Policy
    • Policy Publications
    • Policymaker Education
    • Student Opportunities
  • AI Index

    • AI Index
    • AI Index Report
    • Global Vibrancy Tool
    • People
  • News
  • Events
  • Industry
  • Centers & Labs
news

Researchers Use GPT-4 To Generate Feedback on Scientific Manuscripts

Date
October 26, 2023
Topics
Machine Learning

Combining a large language model and open-source peer-reviewed scientific papers, researchers at Stanford built a tool they hope can help other researchers polish and strengthen their drafts.

Scientific research has a peer problem. There simply aren’t enough qualified peer reviewers to review all the studies. This is a particular challenge for young researchers and those at less well-known institutions who often lack access to experienced mentors who can provide timely feedback. Moreover, many scientific studies get “desk rejected” — summarily denied without peer review.

Sensing a growing crisis in an era of increasing scientific study, AI researchers at Stanford University have used the large language model GPT-4 and a dataset of thousands of previously published papers — replete with their reviewer comments — to create a tool that can “pre-review” draft manuscripts.

“Our hope is that researchers can use this pipeline to improve their drafts prior to official submission to conferences and journals,” said James Zou, an assistant professor of biomedical data science at Stanford and a member of the Stanford Institute for Human-Centered AI (HAI). Zou is the senior author of the study, recently published on preprint service arXiv.

Numbers Don’t Lie

The researchers began by comparing comments made by a large language model against those of human peer reviewers. Fortunately, one of the foremost scientific journals, Nature, and its fifteen sub-journals (Nature Medicine, etc.), not only publishes hundreds of studies a year but includes reviewer comments for some of those papers. And Nature is not alone. The International Conference on Learning Representations (ICLR) does the same with all papers — both accepted and rejected — for its annual machine learning conference.

“Between the two, we curated almost 5,000 peer-reviewed studies and comments to compare with GPT-4’s generated feedback,” Zou says. “The model did surprisingly well.”

The numbers resemble a Venn diagram of overlapping comments. Among the 3,000 or so Nature-family papers in the study, there was intersection between GPT-4 and human comments of almost 31 percent. For ICLR, the numbers were even higher, almost 40 percent of comments by GPT-4 and humans overlapped. What’s more, when looking only at the ICLR’s rejected papers (i.e., less mature papers) the overlap in comments between GPT-4 and humans grew to almost 44 percent — nearly half of all GPT-4 and human comments overlapped.

The significance of these numbers comes into sharper focus in light of the fact that even among humans there is considerable variation among comments by any given paper’s multiple reviewers. Human-to-human overlap was 28 percent for Nature journals and about 35 percent for ICLR. By these metrics, GPT-4 performed comparably to humans.

But while computer-to-human comparisons are instructive, the real test is whether the reviewed paper’s authors valued the comments provided by either review method. Zou’s team conducted a user study where researchers from over 100 institutions submitted their papers, including many preprints, and received GPT-4’s comments. More than half of the participating researchers found GPT-4 feedback “helpful/very helpful” and 82 percent found it “more beneficial” than certain feedback from some human reviewers.

Limits and Horizons

There are caveats to the approach, Zou is quick to highlight in the paper. Notably, GPT-4’s feedback can sometimes be more “generic” and may not pinpoint the deeper technical challenges in the paper. GPT-4 also has the tendency to focus only on limited aspects of scientific feedback (i.e., “add experiments on more datasets”) and comes up short on in-depth Insights on the authors’ methods.

Zou was further careful to emphasize that the team is not suggesting that GPT-4 take the “peer” out of peer review and replace human review. Human expert review “is and should continue to be” the basis of rigorous science, he asserts.

“But we believe AI feedback can benefit researchers in early stages of their paper writing, particularly when considering the growing challenges of getting timely expert feedback on drafts,” Zou concludes. “In that light, we think GPT-4 and human feedback complement one another quite well.”

Stanford HAI’s mission is to advance AI research, education, policy and practice to improve the human condition. Learn more. 

Share
Link copied to clipboard!
Contributor(s)
Andrew Myers

Related News

AI Leaders Discuss How To Foster Responsible Innovation At TIME100 Roundtable In Davos
TIME
Jan 21, 2026
Media Mention

HAI Senior Fellow Yejin Choi discussed responsible AI model training at Davos, asking, “What if there could be an alternative form of intelligence that really learns … morals, human values from the get-go, as opposed to just training LLMs on the entirety of the internet, which actually includes the worst part of humanity, and then we then try to patch things up by doing ‘alignment’?” 

Media Mention
Your browser does not support the video tag.

AI Leaders Discuss How To Foster Responsible Innovation At TIME100 Roundtable In Davos

TIME
Ethics, Equity, InclusionGenerative AIMachine LearningNatural Language ProcessingJan 21

HAI Senior Fellow Yejin Choi discussed responsible AI model training at Davos, asking, “What if there could be an alternative form of intelligence that really learns … morals, human values from the get-go, as opposed to just training LLMs on the entirety of the internet, which actually includes the worst part of humanity, and then we then try to patch things up by doing ‘alignment’?” 

Stanford’s Yejin Choi & Axios’ Ina Fried
Axios
Jan 19, 2026
Media Mention

Axios chief technology correspondent Ina Fried speaks to HAI Senior Fellow Yejin Choi at Axios House in Davos during the World Economic Forum.

Media Mention
Your browser does not support the video tag.

Stanford’s Yejin Choi & Axios’ Ina Fried

Axios
Energy, EnvironmentMachine LearningGenerative AIEthics, Equity, InclusionJan 19

Axios chief technology correspondent Ina Fried speaks to HAI Senior Fellow Yejin Choi at Axios House in Davos during the World Economic Forum.

Spatial Intelligence Is AI’s Next Frontier
TIME
Dec 11, 2025
Media Mention

"This is AI’s next frontier, and why 2025 was such a pivotal year," writes HAI Co-Director Fei-Fei Li.

Media Mention
Your browser does not support the video tag.

Spatial Intelligence Is AI’s Next Frontier

TIME
Computer VisionMachine LearningGenerative AIDec 11

"This is AI’s next frontier, and why 2025 was such a pivotal year," writes HAI Co-Director Fei-Fei Li.