Erik Altman | Synthetic Data Sets: Use Cases for the Financial Industry
IBM Synthetic Data Sets (SDS) have been created for use cases in the financial industry.
Get the latest news, advances in research, policy work, and education program updates from HAI in your inbox weekly.
Sign Up For Latest News
IBM Synthetic Data Sets (SDS) have been created for use cases in the financial industry.
The Center for Decoding the Universe brings together researchers across scientific disciplines to answer the biggest questions about our Universe by leveraging complex data with the most advanced computational methods.

The Center for Decoding the Universe brings together researchers across scientific disciplines to answer the biggest questions about our Universe by leveraging complex data with the most advanced computational methods.
This workshop will cover how NVIDIA RAPIDS offers a seamless experience to enable GPU-acceleration for many existing data science tasks with zero code changes. You will learn how to use GPU-accelerated tools to conduct data science faster, leading to more scalable, reliable, and cost-effective results!

This workshop will cover how NVIDIA RAPIDS offers a seamless experience to enable GPU-acceleration for many existing data science tasks with zero code changes. You will learn how to use GPU-accelerated tools to conduct data science faster, leading to more scalable, reliable, and cost-effective results!
One key focus is fraud and criminal activity, whose cost runs into the hundreds of billions of dollars per year or more. SDS labels many of these criminal activities including money laundering, credit card fraud, check fraud, APP (Authorized Push Payment) fraud (scams), and insurance claims fraud. As such SDS data provides an attractive foundation for training AI detection models.
Unlike much current activity around synthetic data generation, SDS is not built using large language models. Instead SDS uses an agent-based virtual world approach. A key advantage of the SDS design is that all labels are correct: all fraud is labelled fraud, and only fraud is labelled fraud. By contrast, much criminal activity is missed in the real world, including 95% of money laundering by a UN estimate. Hence, even if real data is available, it is often of poor quality for training detection models, or for generating synthetic data.
In practice, access to real data is generally limited to a small number of people at the institution (e.g. a bank) that owns the data. As such real data provides only a narrow view of activity at a single institution – as opposed to the global view provided by SDS data. The SDS approach also yields a broad set of synthetic personal information. This information is highly realistic despite using no information from real individuals.
Development of effective techniques for SDS has required deep expertise across diverse areas. It has also required significant manual effort. How to automate some of these efforts remains an open challenge, as do calibration, scaling, and other areas.
