In 2023, the field of artificial intelligence witnessed a significant transformation — generative AI emerged as the most prominent and impactful story of the year. Driven by remarkable progress in large language models, the technology showcased impressive abilities in domains ranging from healthcare and education to creative arts and political discourse. The year saw models match or surpass human performance in some intricate tasks, such as answering complex medical exam questions, generating persuasive political messages, and even choreographing human dance animations to match diverse pieces of music.
However, with this power came concerns about transparency, bias, and the ethical implications of deploying such sophisticated models in real-world applications. As generative AI became woven into the fabric of daily life, policymakers, researchers, and the public grappled with the need for robust regulations and ethical guidelines to navigate the evolving landscape of artificial intelligence, ensuring responsible and accountable use in the years to come.
Both AI’s technical capabilities and growing concerns captured our readers’ attention this year. Here are the top research papers and thought leadership pieces of 2023.
AI Will Transform Teaching and Learning. Let’s Get it Right.
A major conversation of the generative AI year was how it would impact teaching and learning. The AI+Education Summit, hosted by the Stanford Accelerator for Learning and Stanford HAI, focused on exploring how AI can be effectively employed to enhance human learning. The summit addressed a range of possibilities, including personalized support for teachers using AI, changing priorities for learners, learning without fear of judgment through AI interfaces, and improving learning and assessment quality. While emphasizing the potential benefits of AI in education, such as providing feedback to teachers, aiding skill assessments, and promoting self-confidence in learners, the summit also highlighted significant risks, including the lack of cultural diversity in model output, the non-optimization of responses for student learning, the generation of incorrect responses, and the potential to exacerbate a motivation crisis among students.
2023 State of AI in 14 Charts
The sixth annual AI Index hit in early 2023 at a time when generative tools were taking off, industry was spinning into an AI arms race, and the slowly slumbering policy world was waking up. This year’s AI index captured much of this shakeup, and this story offered a snapshot of what was happening in AI research, education, policy, hiring, and more.
Human Writer or AI? Scholars Build a Detection Tool
Early in 2023, we were just beginning to understand what tools like ChatGPT were capable of, and many of us already saw the need for oversight and tools to identify machine-generated content. Stanford scholars developed DetectGPT, a tool designed to distinguish between human- and large language model-generated text. DetectGPT, in its early stages, demonstrated impressive accuracy in differentiating between human- and AI-generated text across various models. The tool aims to provide transparency in an era where discerning the source of information has become increasingly important, offering potential applications in education, journalism, and society at large.
AI’s Ostensible Emergent Abilities Are a Mirage
This story ricocheted around academic Twitter (er, X). In recent years, people have raised concerns about the unpredictable and potentially harmful nature of large language models. New research challenged those concerns, suggesting that the perceived emergent abilities of AI models are a result of specific metrics used in evaluation. The paper emphasizes that when fair metrics are employed, there is no evidence to support the idea of surprising capabilities. The findings suggest that fears of accidentally stumbling upon unpredictable AI, particularly in the context of artificial general intelligence (AGI), may be unfounded.
Introducing The Foundation Model Transparency Index
This year every major tech company seemed to release their own foundation model. But how much do we actually know about these tools? Scholars at the Center for Research on Foundation Models developed the Foundation Model Transparency Index (FMTI), which evaluates companies and their models on 100 aspects of transparency ranging from model construction to downstream use. Of the 10 companies assessed, Meta achieved the highest score at 54. The results indicated substantial room for improvement across the industry.
How Well Do Large Language Models Support Clinician Information Needs?
Could language models offer curbside consultations to aid physicians? Scholars at Stanford Health Care find mixed results. In a new study, they examined GPT-3.5 and GPT-4 responses to clinical questions. Preliminary results indicated that initial responses were generally safe (91-93% of the time) but only agreed with known answers 21-41% of the time. Harmful responses were often due to hallucinated citations. The study underscores the need for refinement, thorough evaluation, and possibly providing uncertainty estimates before relying on AI language models in healthcare.
Analyzing the European Union AI Act: What Works, What Needs Improvement
The biggest policy news of the year may have been the EU AI Act, a comprehensive, sweeping law positioned to become the world's first comprehensive legal framework for artificial intelligence. While the European Parliament adopted its position in June, negotiations among the European Parliament, the European Council, and the European Commission (trilogue) are necessary before the policy becomes law. In this conversation, politicians and technologists detailed the major areas of negotiation and what’s at stake for the U.S.
AI-Detectors Biased Against Non-Native English Writers
With the rise of generative AI, we’re seeing a similar rise in AI detectors designed to identify content generated by AI, particularly focusing on applications in education and journalism to detect cheating, plagiarism, and misinformation. But this study by Stanford scholars reveals a significant flaw in these detectors: They exhibit low reliability, especially when assessing content written by non-native English speakers. The detectors tend to misclassify essays from foreign students as AI-generated due to differences in language sophistication. The study raises ethical concerns about potential unfair accusations and penalties for foreign-born students and workers.
The Shaky Foundations of Foundation Models in Healthcare
If ChatGPT prescribed you a medication, would you take it? Stanford scholars pose this first question as they offer their insights on large language models in healthcare. Their central question is whether the investment in these models is justified given the difficulty in ensuring their factual correctness and robustness. The authors review more than 80 clinical foundation models, including Clinical Language Models (CLaMs) and Foundation models for Electronic Medical Records (FEMRs). While highlighting the benefits, such as improved predictive accuracy and reduced need for labeled data, they point out significant limitations in evaluating these models. The authors propose a new evaluation paradigm to better align with clinical value, emphasizing the need for clearer metrics and datasets in healthcare applications of LLMs. They certainly acknowledges the risks, including data privacy concerns and interpretability issues, but they remain optimistic about the potential of foundation models in addressing healthcare challenges.
Was This Written by a Human or AI?
Could you tell the difference between a cover letter written by a human and one written by AI? Hm, probably not. In this study, participants could only distinguish between human and AI-generated text with 50-52% accuracy, similar to random chance. The authors suggest solutions like giving AI a recognizable accent or using technical measures like AI watermarking, especially in high-stakes scenarios. The article emphasizes the need for collective efforts to address the challenges posed by the increasing volume of AI-generated content, which could decrease trust among individuals.
ChatGPT Out-scores Medical Students on Complex Clinical Care Exam Questions
We’re just beginning to see how much generative AI will change education. Case in point: A study by Stanford researchers reveals that ChatGPT, particularly the latest version, GPT-4, outperformed first- and second-year medical students in responding to challenging clinical care exam questions, particularly those requiring open-ended, free-response answers. The AI system, trained on the entire corpus of internet content, scored more than four points higher than students on the case-report portion of the exam. While the results highlight the potential of AI in medical education and clinical practice, they also suggest the need for a new approach to teaching future doctors. The findings prompt discussions about incorporating AI tools into medical curricula while ensuring that doctors are trained to effectively use AI in modern practice.
AI’s Powers of Political Persuasion
Researchers at Stanford wanted to see if AI-generated arguments could change minds on controversial hot-button issues. In concerning news, it worked. The scholars used GPT-3 to craft messages on topics like an assault weapon ban, carbon tax, and paid parental leave, and they found that AI-generated persuasive appeals were as effective as those written by humans in influencing humans. While the effect sizes were relatively small, the researchers expressed concerns about the potential misuse of AI, particularly in political contexts, raising questions about the need for regulations on AI use in political activities to prevent misinformation and manipulation.
AI-powered EDGE Dance Animator Applies Generative AI to Choreography
A dance revolution? Stanford researchers developed a generative AI model called Editable Dance Generation (EDGE), which can choreograph human dance animation to match any piece of music. The researchers believe EDGE can help choreographers design sequences and communicate their ideas to live dancers by visualizing 3D dance sequences. The program's key capability is editability, allowing animators to intuitively edit specific parts of dance motion, with EDGE auto-completing the body’s movements in a realistic, seamless, and physically plausible manner.
Stanford HAI’s mission is to advance AI research, education, policy and practice to improve the human condition. Learn more.