Synthetic Data: Even Better than the Real Thing?

Artist's conception of the Synthetic Data Vault. — Synthetic data can be useful in any real data-based context: researchers have demonstrated the use of synthetic data in object detection, in crowd counting, in machine learning for healthcare, and even in marine science for the detection of Western rock l

Our lives are inextricably intertwined with data. It is fundamental in software development, artificial intelligence (AI) training, and product testing; it is deployed across industry, social media, and in decision-making. According to a 2020 report by market research firm International Data Corporation, "More than 59 zettabytes (ZB) of data will be created, captured, copied, and consumed in the world this year."

This is a mind-boggling amount of data, but it is not always available to those who want to make use of it. Innovators working on emerging technologies, such as autonomous vehicles, may find relevant data rare and prohibitively expensive. Access to developers is often limited, due to confidentiality.

Synthetic data, generated from simulations based on real data, has emerged as an answer. It is not a new concept, but recent developments have boosted its accuracy and usability. Add societal issues such as privacy, the General Data Protection Regulation (GDPR), and even the impact of the Covid-19 pandemic on data gathering and access, and the arguments on behalf of synthetic data appear even stronger.

Synthetic data can be useful in any real data-based context: researchers have demonstrated the use of synthetic data in object detection, in crowd counting, in machine learning for healthcare, and even in marine science for the detection of Western rock lobsters.

A group at the Massachusetts Institute of Technology (MIT) led by principal research scientist and Data-to-AI group leader Kalyan Veeramachaneni, have launched an updated set of open-source tools for producing synthetic data. The work is part of the Synthetic Data Vault (SDV), an online ecosystem that allows users to create synthetic data from their own data sources.

Veeramachaneni first experimented with synthetic data back in 2012, to tackle data access bottlenecks in an online learning platform. He realized it could also provide a solution to a problem he had encountered in industry during conversations about data access for machine learning (ML).

"All those conversations come to a grinding halt when we say, 'How can we get access to the data? For that we have to go through this process, and then what do we do next?' It takes three to six months to actually get access to the data," explained Veeramachaneni.

His group set out to build general-purpose tools that would allow anyone to create synthetic data from real data. By 2016, they had succeeded in creating statistical models using datasets from Kaggle, and sampling from those to create synthetic data.

The next step was to take a "much, much more comprehensive" approach by simultaneously creating algorithms, software, and tools that could address any enterprise data type. The result was the Synthetic Data Vault.

The researchers use three types of modelling techniques to generate synthetic data: a classic technique based on Bayesian networks, a mathematical tool from economics called Copulas, and deep learning (DL).

"Deep learning-based synthetic data generation started for images, that's where you see all those deep fakes, and there was a very popular technique called generative adversarial networks (GANs)," said Veeramachaneni.

The MIT group adapted GAN methods used on pixel-based images to work on tabular data. The trick is to generate realistic-looking data, said Veeramachaneni, but it is a fine balancing act, "You don't want it to be so real that it can actually enable you to detect some personal information about someone if it belongs to humans."

The latest tools in the SDV ecosystem support scalability, testing, and interaction with data science teams. To prove the functionality of algorithms and software, users need to come up with edge cases. As Veeramachaneni explained, "Slowly and steadily, we have seen a lot of people coming to it, using it, telling us where it's working, where it's not working, and that's essentially driving us to make it much better."

When the Covid-19 pandemic shut down MIT's Data-to-AI labs, the group spotted another use case. Sensitive data is often housed on one or two computers. Veeramachaneni said the team had to work out how keep their own machines up and running, "Then we were like, 'wouldn't it help to just have synthetic data, so that everyone can have their data on their local machine at home?'"

Privacy and access make a solid case for synthetic data use, but there are others.

Sebastian Drave is chief data scientist at Harbr, a U.K.-based company that provides collaborative enterprise data exchange technology. He also worked on Syntheticr, a financial ecosystem that uses agent-based modelling to generate synthetic data in a simulated banking system.

For Drave, machine learning and AI will be key drivers in synthetic data uptake. Companies will require an increasingly diverse range of input data, and confidence in new techniques can be built on synthetic data; it can accelerate both process and adoption.

Organizations often bring in third parties to provide AI expertise, Drave said, "either as a stepping-stone to developing it themselves or because they don't want to take on the ownership of having to develop that capability." From a risk perspective, synthetic data could be a powerful tool.

Drave said he sees further application in training models for image analysis and pattern recognition. However, he pointed out, challenges remain. Proving the efficacy of synthetic data often means running it against real datasets, to see how the two things line up. "As soon as you get into that world, you sort of flip back into the same problems that having synthetic data in the first place is trying to solve."

The broader issue of bias across machine learning also needs to be addressed. Algorithms need to learn from something, and real data can contain bias. "If you then use an algorithm to produce synthetic data to then train other algorithms to make decisions, you can start to get amplifications of anything that was in that original," said Drave.

Synthetic data still has issues to resolve and it cannot replace the real thing, nor does it set out to. It does, however, appear to have a clear and expanding role in the overall data environment.

Karen Emslie is a location-independent freelance journalist and essayist.