The world’s leading publication for data science, AI, and ML professionals.

How to Use Synthetic and Simulated Data Effectively

Our weekly selection of must-read Editors' Picks and original features

Using synthetic data isn’t exactly a new practice: it’s been a productive approach for several years now, providing practitioners with the data they need for their projects in situations where real-world datasets prove inaccessible, unavailable, or limited from a copyright or approved-use perspective.

The recent rise of LLMs and AI-generated tools has transformed the synthetic-data scene, however, just as it has numerous other workflows for machine learning and Data Science professionals. This week, we’re presenting a collection of recent articles that cover the latest trends and possibilities you should be aware of, as well as the questions and considerations you should keep in mind if you decide to create your own toy dataset from scratch. Let’s dive in!

  • How To Use Generative AI and Python to Create Designer Dummy DatasetsIf it’s been a while since the last time you found yourself in need of Synthetic Data, don’t miss Mia Dwyer‘s concise tutorial, which outlines a streamlined method for creating a dummy dataset with GPT-4 and a little bit of Python. Mia keeps things fairly simple, and you can adapt and build on this approach so it fits your specific needs.
  • Creating Synthetic User Research: Using Persona Prompting and Autonomous AgentsFor a more advanced use case that also relies on the power of generative-AI applications, we recommend catching up with Vincent Koc‘s guide to synthetic user research. It leverages an architecture of autonomous agents to "create and interact with digital customer personas in simulated research scenarios," making user research both more accessible and less resource-heavy.
  • Synthetic Data: The Good, the Bad and the UnsortedWorking with generated data solves some common problems, but can introduce a few others. Tea Mustać focuses on a promising use case—training AI products, which often requires massive amounts of data—and unpacks the legal and ethical concerns that synthetic data can help us bypass, as well as those it can’t.
Photo by Rachel Loughman on Unsplash
Photo by Rachel Loughman on Unsplash
  • Simulated Data, Real Learnings: Scenario AnalysisIn his ongoing series, Jarom Hulet looks at the different ways that simulated data can empower us to make better business and policy decisions and draw powerful insights along the way. After covering model testing and power analysis in previous articles, the latest installment zooms in on the possibility of simulating more complex scenarios for optimized outcomes.
  • Evaluating Synthetic Data – The Million Dollar QuestionThe main assumption behind every process that relies on synthetic data is that the latter sufficiently resembles the statistical properties and patterns of the real data it emulates. Andrew Skabar, PhD offers a detailed guide to help practitioners evaluate the quality of their generated datasets and the degree to which they meet that crucial threshold.

For more thought-provoking articles on other topics—from data career moves to multi-armed pendulums—we invite you to explore these recent standouts:


Thank you for supporting the work of our authors! If you’re feeling inspired to join their ranks, why not write your first post? We’d love to read it.

Until the next Variable,

TDS Team


Related Articles