Foundation Models | Stanford HAI

How Persuasive Is AI-generated Propaganda?

Josh A. Goldstein, Jason Chao, Shelby Grossman, Alex Stamos, Michael Tomz

Feb 20, 2024

Research

Can large language models, a form of artificial intelligence (AI), generate persuasive propaganda? We conducted a preregistered survey experiment of US respondents to investigate the persuasiveness of news articles written by foreign propagandists compared to content generated by GPT-3 davinci (a large language model). We found that GPT-3 can create highly persuasive text as measured by participants’ agreement with propaganda theses. We further investigated whether a person fluent in English could improve propaganda persuasiveness. Editing the prompt fed to GPT-3 and/or curating GPT-3’s output made GPT-3 even more persuasive, and, under certain conditions, as persuasive as the original propaganda. Our findings suggest that propagandists could use AI to create convincing content with limited effort.

How Persuasive Is AI-generated Propaganda?

Josh A. Goldstein, Jason Chao, Shelby Grossman, Alex Stamos, Michael Tomz

Feb 20, 2024

Can large language models, a form of artificial intelligence (AI), generate persuasive propaganda? We conducted a preregistered survey experiment of US respondents to investigate the persuasiveness of news articles written by foreign propagandists compared to content generated by GPT-3 davinci (a large language model). We found that GPT-3 can create highly persuasive text as measured by participants’ agreement with propaganda theses. We further investigated whether a person fluent in English could improve propaganda persuasiveness. Editing the prompt fed to GPT-3 and/or curating GPT-3’s output made GPT-3 even more persuasive, and, under certain conditions, as persuasive as the original propaganda. Our findings suggest that propagandists could use AI to create convincing content with limited effort.

Natural Language Processing

Foundation Models

Generative AI

Research

AI Seeks Out Racist Language in Property Deeds for Termination

Bloomberg Law

Oct 17, 2024

Media Mention

Dan Ho, HAI Senior Fellow and director of the Stanford RegLab, discusses RegLab's AI model that analyzes decades of property records, helping to identify illegal racially restrictive language in housing documents.

AI Seeks Out Racist Language in Property Deeds for Termination

Bloomberg Law

Oct 17, 2024

Dan Ho, HAI Senior Fellow and director of the Stanford RegLab, discusses RegLab's AI model that analyzes decades of property records, helping to identify illegal racially restrictive language in housing documents.

Machine Learning

Regulation, Policy, Governance

Foundation Models

Law Enforcement and Justice

Media Mention

DSPy: Compiling Declarative Language Model Calls into State-of-the-Art Pipelines

Omar Khattab, Matei Zaharia, Christopher Potts

Jan 16, 2024

Research

The ML community is rapidly exploring techniques for prompting language models (LMs) and for stacking them into pipelines that solve complex tasks. Unfortunately, existing LM pipelines are typically implemented using hard-coded “prompt templates”, i.e. lengthy strings discovered via trial and error. Toward a more systematic approach for developing and optimizing LM pipelines, we introduce DSPy, a programming model that abstracts LM pipelines as text transformation graphs, or imperative computational graphs where LMs are invoked through declarative modules. DSPy modules are parameterized, meaning they can learn how to apply compositions of prompting, finetuning, augmentation, and reasoning techniques. We design a compiler that will optimize any DSPy pipeline to maximize a given metric, by creating and collecting demonstrations. We conduct two case studies, showing that succinct DSPy programs can express and optimize pipelines that reason about math word problems, tackle multi-hop retrieval, answer complex questions, and control agent loops. Within minutes of compiling, DSPy can automatically produce pipelines that outperform out-of-the-box few-shot prompting as well as expert-created demonstrations for GPT-3.5 and Llama2-13b-chat. On top of that, DSPy programs compiled for relatively small LMs like 770M parameter T5 and Llama2-13b-chat are competitive with many approaches that rely on large and proprietary LMs like GPT-3.5 and on expert-written prompt chains. DSPy is available at https://github.com/stanfordnlp/dspy

DSPy: Compiling Declarative Language Model Calls into State-of-the-Art Pipelines

Omar Khattab, Matei Zaharia, Christopher Potts

Jan 16, 2024

The ML community is rapidly exploring techniques for prompting language models (LMs) and for stacking them into pipelines that solve complex tasks. Unfortunately, existing LM pipelines are typically implemented using hard-coded “prompt templates”, i.e. lengthy strings discovered via trial and error. Toward a more systematic approach for developing and optimizing LM pipelines, we introduce DSPy, a programming model that abstracts LM pipelines as text transformation graphs, or imperative computational graphs where LMs are invoked through declarative modules. DSPy modules are parameterized, meaning they can learn how to apply compositions of prompting, finetuning, augmentation, and reasoning techniques. We design a compiler that will optimize any DSPy pipeline to maximize a given metric, by creating and collecting demonstrations. We conduct two case studies, showing that succinct DSPy programs can express and optimize pipelines that reason about math word problems, tackle multi-hop retrieval, answer complex questions, and control agent loops. Within minutes of compiling, DSPy can automatically produce pipelines that outperform out-of-the-box few-shot prompting as well as expert-created demonstrations for GPT-3.5 and Llama2-13b-chat. On top of that, DSPy programs compiled for relatively small LMs like 770M parameter T5 and Llama2-13b-chat are competitive with many approaches that rely on large and proprietary LMs like GPT-3.5 and on expert-written prompt chains. DSPy is available at https://github.com/stanfordnlp/dspy

Foundation Models

Natural Language Processing

Machine Learning

Research

The 12 Greatest Dangers Of AI

Forbes

Oct 09, 2024

Media Mention

AI expert Gary Marcus references HAI's study showing that LLM responses to medical questions highly vary and are often inaccurate.

The 12 Greatest Dangers Of AI

Forbes

Oct 09, 2024

AI expert Gary Marcus references HAI's study showing that LLM responses to medical questions highly vary and are often inaccurate.

Natural Language Processing

Foundation Models

Generative AI

Media Mention

Evaluating Human and Machine Understanding of Data Visualizations

Arnav Verma, Kushin Mukherjee, Christopher Potts, Elisa Kreiss, Judith Fan

Jan 01, 2024

Research

Although data visualizations are a relatively recent invention, most people are expected to know how to read them. How do current machine learning systems compare with people when performing tasks involving data visualizations? Prior work evaluating machine data visualization understanding has relied upon weak benchmarks that do not resemble the tests used to assess these abilities in humans. We evaluated several state-of-the-art algorithms on data visualization literacy assessments designed for humans, and compared their responses to multiple cohorts of human participants with varying levels of experience with high school-level math. We found that these models systematically underperform all human cohorts and are highly sensitive to small changes in how they are prompted. Among the models we tested, GPT-4V most closely approximates human error patterns, but gaps remain between all models and humans. Our findings highlight the need for stronger benchmarks for data visualization understanding to advance artificial systems towards human-like reasoning about data visualizations.

Evaluating Human and Machine Understanding of Data Visualizations

Arnav Verma, Kushin Mukherjee, Christopher Potts, Elisa Kreiss, Judith Fan

Jan 01, 2024

Although data visualizations are a relatively recent invention, most people are expected to know how to read them. How do current machine learning systems compare with people when performing tasks involving data visualizations? Prior work evaluating machine data visualization understanding has relied upon weak benchmarks that do not resemble the tests used to assess these abilities in humans. We evaluated several state-of-the-art algorithms on data visualization literacy assessments designed for humans, and compared their responses to multiple cohorts of human participants with varying levels of experience with high school-level math. We found that these models systematically underperform all human cohorts and are highly sensitive to small changes in how they are prompted. Among the models we tested, GPT-4V most closely approximates human error patterns, but gaps remain between all models and humans. Our findings highlight the need for stronger benchmarks for data visualization understanding to advance artificial systems towards human-like reasoning about data visualizations.

Foundation Models

Research

WikiChat: Stopping the Hallucination of Large Language Model Chatbots by Few-Shot Grounding on Wikipedia

Sina Semnani, Violet Yao, Monica Lam, Heidi Zhang

Dec 01, 2023

Research

This paper presents the first few-shot LLM-based chatbot that almost never hallucinates and has high conversationality and low latency. WikiChat is grounded on the English Wikipedia, the largest curated free-text corpus. WikiChat generates a response from an LLM, retains only the grounded facts, and combines them with additional information it retrieves from the corpus to form factual and engaging responses. We distill WikiChat based on GPT-4 into a 7B-parameter LLaMA model with minimal loss of quality, to significantly improve its latency, cost and privacy, and facilitate research and deployment. Using a novel hybrid human-and-LLM evaluation methodology, we show that our best system achieves 97.3% factual accuracy in simulated conversations. It significantly outperforms all retrieval-based and LLM-based baselines, and by 3.9%, 38.6% and 51.0% on head, tail and recent knowledge compared to GPT-4. Compared to previous state-of-the-art retrieval-based chatbots, WikiChat is also significantly more informative and engaging, just like an LLM. WikiChat achieves 97.9% factual accuracy in conversations with human users about recent topics, 55.0% better than GPT-4, while receiving significantly higher user ratings and more favorable comments.

WikiChat: Stopping the Hallucination of Large Language Model Chatbots by Few-Shot Grounding on Wikipedia

Sina Semnani, Violet Yao, Monica Lam, Heidi Zhang

Dec 01, 2023

This paper presents the first few-shot LLM-based chatbot that almost never hallucinates and has high conversationality and low latency. WikiChat is grounded on the English Wikipedia, the largest curated free-text corpus. WikiChat generates a response from an LLM, retains only the grounded facts, and combines them with additional information it retrieves from the corpus to form factual and engaging responses. We distill WikiChat based on GPT-4 into a 7B-parameter LLaMA model with minimal loss of quality, to significantly improve its latency, cost and privacy, and facilitate research and deployment. Using a novel hybrid human-and-LLM evaluation methodology, we show that our best system achieves 97.3% factual accuracy in simulated conversations. It significantly outperforms all retrieval-based and LLM-based baselines, and by 3.9%, 38.6% and 51.0% on head, tail and recent knowledge compared to GPT-4. Compared to previous state-of-the-art retrieval-based chatbots, WikiChat is also significantly more informative and engaging, just like an LLM. WikiChat achieves 97.9% factual accuracy in conversations with human users about recent topics, 55.0% better than GPT-4, while receiving significantly higher user ratings and more favorable comments.

Natural Language Processing

Foundation Models

Machine Learning

Generative AI

Research

All Work Published on Foundation Models