Training Data is the collection of examples—such as text, images, audio, or other information—used to teach machine learning models how to perform specific tasks. The model learns by analyzing patterns, relationships, and features within this data, adjusting its internal parameters to make accurate predictions or decisions. The quality, quantity, and diversity of training data largely determine how well an AI system will perform, making it one of the most critical components of machine learning.
Get the latest news, advances in research, policy work, and education program updates from HAI in your inbox weekly.
Sign Up For Latest News
Explore Similar Terms:

Using “Pile of Law,” a dataset of legal materials, Stanford researchers explore filtering private or toxic content from training data for foundation models.
Using “Pile of Law,” a dataset of legal materials, Stanford researchers explore filtering private or toxic content from training data for foundation models.


This brief highlights the lack of geographic representation in medical-imaging AI training data and calls for nationwide, diversity-focused data-sharing initiatives.
This brief highlights the lack of geographic representation in medical-imaging AI training data and calls for nationwide, diversity-focused data-sharing initiatives.

Computer scientists must identify sources of bias, de-bias training data and develop artificial-intelligence algorithms that are robust to skews in the data, argue James Zou and Londa Schiebinger in Nature.
Computer scientists must identify sources of bias, de-bias training data and develop artificial-intelligence algorithms that are robust to skews in the data, argue James Zou and Londa Schiebinger in Nature.

In risk modeling, AI researchers take a more-is-better approach to training data, but a new study argues that a less-is-more approach may be preferable.
In risk modeling, AI researchers take a more-is-better approach to training data, but a new study argues that a less-is-more approach may be preferable.

A new study reveals models aren’t reporting enough, leaving users blind to potential model errors such as flawed training data and calibration drift.
A new study reveals models aren’t reporting enough, leaving users blind to potential model errors such as flawed training data and calibration drift.

The lessons learned from the fine-tuning and evaluation of Vietnamese LLMs could help broaden access to models beyond English speakers.
The lessons learned from the fine-tuning and evaluation of Vietnamese LLMs could help broaden access to models beyond English speakers.


This brief introduces a quantitative framework that allows policymakers to evaluate the behavior of language models to assess what kinds of opinions they reflect.
This brief introduces a quantitative framework that allows policymakers to evaluate the behavior of language models to assess what kinds of opinions they reflect.
