Stanford
University
  • Stanford Home
  • Maps & Directions
  • Search Stanford
  • Emergency Info
  • Terms of Use
  • Privacy
  • Copyright
  • Trademarks
  • Non-Discrimination
  • Accessibility
© Stanford University.  Stanford, California 94305.
Erik Altman | Synthetic Data Sets: Use Cases for the Financial Industry | Stanford HAI
Skip to content
  • About

    • About
    • People
    • Get Involved with HAI
    • Support HAI
    • Subscribe to Email
  • Research

    • Research
    • Fellowship Programs
    • Grants
    • Student Affinity Groups
    • Centers & Labs
    • Research Publications
    • Research Partners
  • Education

    • Education
    • Executive and Professional Education
    • Government and Policymakers
    • K-12
    • Stanford Students
  • Policy

    • Policy
    • Policy Publications
    • Policymaker Education
    • Student Opportunities
  • AI Index

    • AI Index
    • AI Index Report
    • Global Vibrancy Tool
    • People
  • News
  • Events
  • Industry
  • Centers & Labs
Navigate
  • About
  • Events
  • AI Glossary
  • Careers
  • Search
Participate
  • Get Involved
  • Support HAI
  • Contact Us

Stay Up To Date

Get the latest news, advances in research, policy work, and education program updates from HAI in your inbox weekly.

Sign Up For Latest News

Your browser does not support the video tag.
eventSeminar

Erik Altman | Synthetic Data Sets: Use Cases for the Financial Industry

Status
Past
Date
Wednesday, May 07, 2025 12:00 PM - 1:15 PM PST/PDT
Location
Gates Computer Science Building, Room 119 353 Jane Stanford Way Stanford, CA 94305
Topics
Finance, Business

IBM Synthetic Data Sets (SDS) have been created for use cases in the financial industry.  

One key focus is fraud and criminal activity, whose cost runs into the hundreds of billions of dollars per year or more.  SDS labels many of these criminal activities including money laundering, credit card fraud, check fraud, APP (Authorized Push Payment) fraud (scams), and insurance claims fraud.  As such SDS data provides an attractive foundation for training AI detection models.

Unlike much current activity around synthetic data generation, SDS is not built using large language models.  Instead SDS uses an agent-based virtual world approach.  A key advantage of the SDS design is that all labels are correct:  all fraud is labelled fraud, and only fraud is labelled fraud.  By contrast, much criminal activity is missed in the real world, including 95% of money laundering by a UN estimate.  Hence, even if real data is available, it is often of poor quality for training detection models, or for generating synthetic data.

In practice, access to real data is generally limited to a small number of people at the institution (e.g. a bank) that owns the data.  As such real data provides only a narrow view of activity at a single institution – as opposed to the global view provided by SDS data.  The SDS approach also yields a broad set of synthetic personal information.  This information is highly realistic despite using no information from real individuals.

Development of effective techniques for SDS has required deep expertise across diverse areas.  It has also required significant manual effort.  How to automate some of these efforts remains an open challenge, as do calibration, scaling, and other areas.

Speakers
Erik Altman
IBM Researcher

Watch Event Recording

Share
Link copied to clipboard!
Event Contact
Annie Benisch
abenisch@stanford.edu
2099183302
More from HAI and SDS seminars
  • Inside the 2026 AI Index Report | Stanford HAI
    SeminarMay 20, 202612:00 PM - 1:15 PM
    May
    20
    2026

    The AI Index, currently in its ninth year, tracks, collates, distills, and visualizes data relating to artificial intelligence.

Related Events

Inside the 2026 AI Index Report | Stanford HAI
SeminarMay 20, 202612:00 PM - 1:15 PM
May
20
2026

The AI Index, currently in its ninth year, tracks, collates, distills, and visualizes data relating to artificial intelligence.

Seminar

Inside the 2026 AI Index Report | Stanford HAI

May 20, 202612:00 PM - 1:15 PM

The AI Index, currently in its ninth year, tracks, collates, distills, and visualizes data relating to artificial intelligence.

Ashesh Rambachan | From Next-Token Prediction to Automatic Induction of Automata
Apr 13, 202612:00 PM - 1:00 PM
April
13
2026

Sequence data is ubiquitous in economics — job histories in labor economics, diagnosis and treatment sequences in health economics, strategic interactions in game theory. Generative sequence models can learn to predict these sequences well, but their complexity makes it hard to extract interpretable economic insights from their predictions.

Event

Ashesh Rambachan | From Next-Token Prediction to Automatic Induction of Automata

Apr 13, 202612:00 PM - 1:00 PM

Sequence data is ubiquitous in economics — job histories in labor economics, diagnosis and treatment sequences in health economics, strategic interactions in game theory. Generative sequence models can learn to predict these sequences well, but their complexity makes it hard to extract interpretable economic insights from their predictions.

Caroline Meinhardt, Thomas Mullaney, Juan N. Pava, and Diyi Yang | How Can AI Support Language Digitization and Digital Inclusion?
SeminarApr 15, 202612:00 PM - 1:15 PM
April
15
2026

What does digital inclusion look like in the age of AI? Over 6,000 of the world’s 7,000-plus living languages remain digitally disadvantaged.

Seminar

Caroline Meinhardt, Thomas Mullaney, Juan N. Pava, and Diyi Yang | How Can AI Support Language Digitization and Digital Inclusion?

Apr 15, 202612:00 PM - 1:15 PM

What does digital inclusion look like in the age of AI? Over 6,000 of the world’s 7,000-plus living languages remain digitally disadvantaged.