Today, we are surrounded by AI hype. New AI-powered tools are announced almost every single day. They claim they’ll do almost anything for us: drive our cars, write our emails, make us art. Yet even for the biggest, splashiest tools—like ChatGPT—it’s unclear whether the AI approach is an improvement on what they’re meant to replace. It’s difficult to separate what is genuinely useful from what is little more than noise. AI’s biggest problem is delivering on its promise.
There is an exception: synthetic data.
What is synthetic data?
Synthetic data is AI-generated data that mirrors the statistical properties of real-world data. By training AI models on real data, industries as varied as healthcare, manufacturing, finance, or software development can generate synthetic data to suit their every need. Wherever, and whenever, they need it, with the scope and scale they desire.
Synthetic data solves several problems. For AI model development, synthetic data can mitigate the lack of affordable, high quality data. For software development and testing, synthetic datasets can help test edge cases, simulate complex data scenarios, and validate the quality of systems under likely real-world conditions. While access to live production data is rightly restricted, this can hamper innovation across an organization. Synthetic data can have far fewer restrictions, freeing teams to build without unnecessary friction.
Companies like Amazon, Google and American Express already rely on synthetic data, as do organizations like the UK’s National Health Service. Your company/sector probably could too.
Synthetic, but not fake
Synthetic data is sometimes confused with fake data, and many use the two terms interchangeably. However, they are very different things. Fake data, or mock data, is cheap and easy to generate. Fake data can be acquired via open-source libraries, such as Faker. However, fake data doesn’t have the same statistical properties as real data. It tends to be simple and uniform. For instance, if we generated a fake database of 100 transactions between $1 and $10,000, 10 would be between $1 – $1000, 10 between $1001- $2000, and so on. Real-world purchase data is lumpy. Some transactions cluster together, while some are outliers.
Fake data possesses few to none of the properties or characteristics of a real production dataset. Beyond simple parameters like range and data type, any resemblance to the real data is purely by chance. By contrast, synthetic data is constructed with statistical models and generative AI trained on real data. This synthetic data possesses the same statistical properties and internal relationships as the real-world dataset it’s meant to mimic.
While both fake and synthetic data are useful, they are completely different tools. In real-world scenarios, these differences become very important. Let’s look at two examples: one in online retail and one in data science.
Synthetic data for testing software applications
Say an online sporting goods retailer has analyzed their data and noticed a few trends. They found that they get almost three times as many visitors from Massachusetts as from any other state, that a visitor from MA is most likely to buy snow boots in November, and that site traffic is expected to spike before Thanksgiving.
To take advantage of these findings, the retailer updates their website so that it shows snow boots to anyone coming to the website from MA during the three weeks before Thanksgiving. They also customize results for customers that have opted in to greater personalization, showing particular snow boot models based on each individual visitor’s purchase history and personal preferences.
Before the retailer rolls out these changes in their application, they want to test them. They want to be ready for a spike: Even if tens of thousands of visits happen during this three week window, the website should respond within less than a millisecond. They also want to make sure the right boots are shown to the right person at the right time to maximize the possibility of a purchase. To run these tests, they need data.
What will happen if they use fake data? Because fake data is randomly generated, it will generate visitors from every state with equal frequency, and for every date in the year with equal frequency. Even if the team decides to generate millions of fake visits and then throw away anything that is not from MA and within their date range, the fake data will not have information related to customers’ purchase history to test the part of the code that customizes which snow boots to show. In testing and development environments, the application’s performance looked fine, but when real customers visit the website, performance is slow because of clustering that was missing from the fake data.
What if the retailer used synthetic data instead? Synthetic data generated using an AI model, trained on the retailer’s real data, can emulate real customers. It can create entire customer journeys, from initial account creation through purchases made over the past two years; a realistic, synthetic customer.
If real customers bought product A and then bought product B six months later, the synthetic customers will follow this pattern. If there was a spike in traffic from MA in November, the synthetic dataset will emulate that. With synthetic data, the retailer can create data that reflects the real visits they expect, taking into account visitor locations, traffic spikes, and complex purchase histories. By testing with this data, they get a more accurate idea of what to expect, and can properly prepare their application.
Modern software applications are increasingly dynamic, adapting their output based on the data they see in real time. Their logic is frequently updated and new versions are deployed rapidly, sometimes multiple times a day. Before each deployment, developers must test it performs well and functions correctly. Those that use synthetic data, not just fake data, have greater confidence their customers will have a great experience, and also make more sales.
Synthetic data removes the analyst bottleneck
Enterprises store enormous amounts of data about how their customers are using their products and services, hoping it will provide insights that can help drive the bottom line. To obtain these insights, they may hire consulting firms or freelance data scientists, or even hold public data science competitions. But their desire to get as many eyes on the data as possible often conflicts with the proprietary nature of data, as well as customer privacy concerns. Fake data again won’t help in this scenario, because it lacks the realistic properties of production data: the internal correlations and other statistical properties that lead to valuable insights.
For a data set to stand in for real data, it must deliver the same analytical conclusions as real data would. To return to the above example, if the real data shows that snow boots are the most popular purchase for customers from MA, an analyst using synthetic data must reach the same conclusion. Can synthetic data really be that good?
To answer this question systematically, my team at MIT has done a series of experiments.
The first one dates back to 2017, when my group hired freelance data scientists to develop predictive models as part of a crowd-sourced experiment. We wanted to figure out: “Is there any difference between the work of data scientists given synthetic data, and those with access to real data?”
To test this, one group of data scientists was given the original, real data, while the other three were given synthetic versions. Each group used their data to solve a predictive modeling problem, eventually conducting 15 tests across 5 datasets. In the end, when their solutions were compared, those generated by the group using real data and those generated by the groups using synthetic data displayed no significant performance difference in 11 out of the 15 tests (70 percent of the time).
Since then, synthetic data has become a staple in data science competitions, and it is beginning to transform data sharing and analysis for enterprises. Kaggle, a popular data science competition website, now releases synthetic datasets regularly, including some from enterprise. Wells Fargo released a synthetic dataset for a competition in which data scientists were asked to predict suspected fraud related to elder exploitation. Spar Nord bank released an anti money laundering dataset for data scientists to find patterns that are indicative of money laundering.
Conclusion
Synthetic data is a useful application of AI technology that is already delivering real, tangible value to customers. More than mere fake data, synthetic data supports data-driven business systems throughout their lifecycle, particularly where ongoing access to production data is impractical or ill-advised.
If your projects are hampered by expensive and complex processes to access production data, or limited by the inherent restrictions of fake data, synthetic data is worth exploring. You can start using synthetic data today by downloading one of the freely available options.
Synthetic data is a valuable new technique that more and more organizations are adding to their data-driven workloads. Ask your data teams where you could use synthetic data and break free of the fakers and the hype.
About the Author
Kalyan Veeramachaneni is the co-founder and CEO of DataCebo, the synthetic data company revolutionizing developer productivity at enterprises by leveraging generative AI. He is also a principal research scientist at MIT where he founded and directs a research lab called Data-to-AI housed within MIT’s Schwarzman College of Computing. At the lab, they build technologies that enable development, validation and deployment of large-scale AI applications derived from data.
Sign up for the free insideAI News newsletter.
Join us on Twitter: https://twitter.com/InsideBigData1
Join us on LinkedIn: https://www.linkedin.com/company/insideainews/
Join us on Facebook: https://www.facebook.com/insideAINEWSNOW
Leave a Reply