What is Synthetic Data? | Stanford HAI
Stanford
University
  • Stanford Home
  • Maps & Directions
  • Search Stanford
  • Emergency Info
  • Terms of Use
  • Privacy
  • Copyright
  • Trademarks
  • Non-Discrimination
  • Accessibility
© Stanford University.  Stanford, California 94305.
Skip to content
  • About

    • About
    • People
    • Get Involved with HAI
    • Support HAI
    • Subscribe to Email
  • Research

    • Research
    • Fellowship Programs
    • Grants
    • Student Affinity Groups
    • Centers & Labs
    • Research Publications
    • Research Partners
  • Education

    • Education
    • Executive and Professional Education
    • Government and Policymakers
    • K-12
    • Stanford Students
  • Policy

    • Policy
    • Policy Publications
    • Policymaker Education
    • Student Opportunities
  • AI Index

    • AI Index
    • AI Index Report
    • Global Vibrancy Tool
    • People
  • News
  • Events
  • Industry
  • Centers & Labs

What is Synthetic Data?

Synthetic Data is artificially generated information created by algorithms or simulations rather than collected from real-world events or observations. It's used to train AI models when real data is scarce, expensive, privacy-sensitive, or difficult to obtain, while mimicking the statistical properties and patterns of authentic data. Synthetic data is particularly valuable for addressing data gaps, testing edge cases, and protecting privacy in fields like healthcare, autonomous driving, and financial modeling. Critics say that Synthetic Data may introduce biases, fail to capture real-world complexity and edge cases, or create "model collapse" when AI systems are trained predominantly on AI-generated content.

Navigate
  • About
  • Events
  • Careers
  • Search
Participate
  • Get Involved
  • Support HAI
  • Contact Us

Stay Up To Date

Get the latest news, advances in research, policy work, and education program updates from HAI in your inbox weekly.

Sign Up For Latest News


Synthetic Data mentioned at Stanford HAI

Explore Similar Terms:

Data Augmentation | Training Data | GANs (Generative Adversarial Networks)

See Full List of Terms & Definitions

Enroll in a Human-Centered AI Course

This AI program covers technical fundamentals, business implications, and societal considerations.
Conditional Generative Models for Synthetic Tabular Data: Applications for Precision Medicine and Diverse Representations
Kara Liu, Russ Altman
Deep DiveJan 14
Research
Your browser does not support the video tag.

Tabular medical datasets, like electronic health records (EHRs), biobanks, and structured clinical trial data, are rich sources of information with the potential to advance precision medicine and optimize patient care. However, real-world medical datasets have limited patient diversity and cannot simulate hypothetical outcomes, both of which are necessary for equitable and effective medical research. Fueled by recent advancements in machine learning, generative models offer a promising solution to these data limitations by generating enhanced synthetic data. This review highlights the potential of conditional generative models (CGMs) to create patient-specific synthetic data for a variety of precision medicine applications. We survey CGM approaches that tackle two medical applications: correcting for data representation biases and simulating digital health twins. We additionally explore how the surveyed methods handle modeling tabular medical data and briefly discuss evaluation criteria. Finally, we summarize the technical, medical, and ethical challenges that must be addressed before CGMs can be effectively and safely deployed in the medical field.

Conditional Generative Models for Synthetic Tabular Data: Applications for Precision Medicine and Diverse Representations

Kara Liu, Russ Altman
Deep DiveJan 14

Tabular medical datasets, like electronic health records (EHRs), biobanks, and structured clinical trial data, are rich sources of information with the potential to advance precision medicine and optimize patient care. However, real-world medical datasets have limited patient diversity and cannot simulate hypothetical outcomes, both of which are necessary for equitable and effective medical research. Fueled by recent advancements in machine learning, generative models offer a promising solution to these data limitations by generating enhanced synthetic data. This review highlights the potential of conditional generative models (CGMs) to create patient-specific synthetic data for a variety of precision medicine applications. We survey CGM approaches that tackle two medical applications: correcting for data representation biases and simulating digital health twins. We additionally explore how the surveyed methods handle modeling tabular medical data and briefly discuss evaluation criteria. Finally, we summarize the technical, medical, and ethical challenges that must be addressed before CGMs can be effectively and safely deployed in the medical field.

Healthcare
Regulation, Policy, Governance
Your browser does not support the video tag.
Research
Representation Learning with Statistical Independence to Mitigate Bias
Ehsan Adeli, Qingyu Zhao, Adolf Pfefferbaum, Edith Sullivan, Fei-Fei Li, Juan Carlos Niebles, Kilian Pohl
Dec 03
Research

Presence of bias (in datasets or tasks) is inarguably one of the most critical challenges in machine learning applications that has alluded to pivotal debates in recent years. Such challenges range from spurious associations between variables in medical studies to the bias of race in gender or face recognition systems. Controlling for all types of biases in the dataset curation stage is cumbersome and sometimes impossible. The alternative is to use the available data and build models incorporating fair representation learning. In this paper, we propose such a model based on adversarial training with two competing objectives to learn features that have (1) maximum discriminative power with respect to the task and (2) minimal statistical mean dependence with the protected (bias) variable(s). Our approach does so by incorporating a new adversarial loss function that encourages a vanished correlation between the bias and the learned features. We apply our method to synthetic data, medical images (containing task bias), and a dataset for gender classification (containing dataset bias). Our results show that the learned features by our method not only result in superior prediction performance but also are unbiased.

Representation Learning with Statistical Independence to Mitigate Bias

Ehsan Adeli, Qingyu Zhao, Adolf Pfefferbaum, Edith Sullivan, Fei-Fei Li, Juan Carlos Niebles, Kilian Pohl
Dec 03

Presence of bias (in datasets or tasks) is inarguably one of the most critical challenges in machine learning applications that has alluded to pivotal debates in recent years. Such challenges range from spurious associations between variables in medical studies to the bias of race in gender or face recognition systems. Controlling for all types of biases in the dataset curation stage is cumbersome and sometimes impossible. The alternative is to use the available data and build models incorporating fair representation learning. In this paper, we propose such a model based on adversarial training with two competing objectives to learn features that have (1) maximum discriminative power with respect to the task and (2) minimal statistical mean dependence with the protected (bias) variable(s). Our approach does so by incorporating a new adversarial loss function that encourages a vanished correlation between the bias and the learned features. We apply our method to synthetic data, medical images (containing task bias), and a dataset for gender classification (containing dataset bias). Our results show that the learned features by our method not only result in superior prediction performance but also are unbiased.

Machine Learning
Research
Using Satellite Imagery to Understand and Promote Sustainable Development
Marshall Burke, Anne Driscoll, David Lobell, Stefano Ermon
Quick ReadApr 01
policy brief

This brief discusses the opportunities and limitations of AI models that can map satellite image inputs to sustainable development outcomes.

Using Satellite Imagery to Understand and Promote Sustainable Development

Marshall Burke, Anne Driscoll, David Lobell, Stefano Ermon
Quick ReadApr 01

This brief discusses the opportunities and limitations of AI models that can map satellite image inputs to sustainable development outcomes.

Energy, Environment
policy brief
An Open-Source AI Agent for Doing Tasks on the Web
Katharine Miller
Mar 27
news

NNetNav learns how to navigate websites by mimicking childhood learning through exploration.

An Open-Source AI Agent for Doing Tasks on the Web

Katharine Miller
Mar 27

NNetNav learns how to navigate websites by mimicking childhood learning through exploration.

Machine Learning
Natural Language Processing
news
A New Approach to the Data-Deletion Conundrum
Andrew Myers
Sep 24
news

A team of computer scientists devised a way to quickly remove traces of sensitive user information from machine learning models.

A New Approach to the Data-Deletion Conundrum

Andrew Myers
Sep 24

A team of computer scientists devised a way to quickly remove traces of sensitive user information from machine learning models.

Machine Learning
news