What is Training Data?

Training Data is the collection of examples—such as text, images, audio, or other information—used to teach machine learning models how to perform specific tasks. The model learns by analyzing patterns, relationships, and features within this data, adjusting its internal parameters to make accurate predictions or decisions. The quality, quantity, and diversity of training data largely determine how well an AI system will perform, making it one of the most critical components of machine learning.

Stay Up To Date

Get the latest news, advances in research, policy work, and education program updates from HAI in your inbox weekly.

Sign Up For Latest News

Borrowing from the Law to Filter Training Data for Foundation Models

Katharine Miller

Aug 10

news

Using “Pile of Law,” a dataset of legal materials, Stanford researchers explore filtering private or toxic content from training data for foundation models.

Borrowing from the Law to Filter Training Data for Foundation Models

Katharine Miller

Aug 10

Using “Pile of Law,” a dataset of legal materials, Stanford researchers explore filtering private or toxic content from training data for foundation models.

Machine Learning

news

Toward Fairness in Health Care Training Data

Amit Kaushal, Russ Altman, Curtis Langlotz

Quick ReadOct 01

policy brief

This brief highlights the lack of geographic representation in medical-imaging AI training data and calls for nationwide, diversity-focused data-sharing initiatives.

Toward Fairness in Health Care Training Data

Amit Kaushal, Russ Altman, Curtis Langlotz

Quick ReadOct 01

This brief highlights the lack of geographic representation in medical-imaging AI training data and calls for nationwide, diversity-focused data-sharing initiatives.

Healthcare

Ethics, Equity, Inclusion

policy brief

AI can be sexist and racist — it’s time to make it fair

James Zou and Londa Schiebinger

Jul 17

news

Computer scientists must identify sources of bias, de-bias training data and develop artificial-intelligence algorithms that are robust to skews in the data, argue James Zou and Londa Schiebinger in Nature.

AI can be sexist and racist — it’s time to make it fair

James Zou and Londa Schiebinger

Jul 17

Arts, Humanities

Machine Learning

news

How Bias Hides in ‘Kitchen Sink’ Approaches to Data

Julian Nyarko

Andrew Myers

May 30

news

In risk modeling, AI researchers take a more-is-better approach to training data, but a new study argues that a less-is-more approach may be preferable.

How Bias Hides in ‘Kitchen Sink’ Approaches to Data

Julian Nyarko

Andrew Myers

May 30

In risk modeling, AI researchers take a more-is-better approach to training data, but a new study argues that a less-is-more approach may be preferable.

Natural Language Processing

Machine Learning

news

“Flying in the Dark”: Hospital AI Tools Aren’t Well Documented

Edmund L. Andrews

Aug 23

news

A new study reveals models aren’t reporting enough, leaving users blind to potential model errors such as flawed training data and calibration drift.

“Flying in the Dark”: Hospital AI Tools Aren’t Well Documented

Edmund L. Andrews

Aug 23

A new study reveals models aren’t reporting enough, leaving users blind to potential model errors such as flawed training data and calibration drift.

Healthcare

Machine Learning

news

Improving Equity and Access to Non-English Large Language Models

Prabha Kannan

Apr 22

news

The lessons learned from the fine-tuning and evaluation of Vietnamese LLMs could help broaden access to models beyond English speakers.

Improving Equity and Access to Non-English Large Language Models

Prabha Kannan

Apr 22

The lessons learned from the fine-tuning and evaluation of Vietnamese LLMs could help broaden access to models beyond English speakers.

Natural Language Processing

news

Whose Opinions Do Language Models Reflect?

Shibani Santurkar, Esin Durmus, Faisal Ladhak, Cinoo Lee, Percy Liang, Tatsunori Hashimoto

Quick ReadSep 20

policy brief

This brief introduces a quantitative framework that allows policymakers to evaluate the behavior of language models to assess what kinds of opinions they reflect.

Whose Opinions Do Language Models Reflect?

Shibani Santurkar, Esin Durmus, Faisal Ladhak, Cinoo Lee, Percy Liang, Tatsunori Hashimoto

Quick ReadSep 20

This brief introduces a quantitative framework that allows policymakers to evaluate the behavior of language models to assess what kinds of opinions they reflect.

Generative AI

Ethics, Equity, Inclusion