Synthetic data is of interest to more and more sectors, particularly to deal with the lack of training data, but it also raises many questions. This is one of the subjects we worked on in ActuIA magazine n°9, currently on newsstands. Within DataCebo, the study of synthetic data has led to the creation of a specific module called Synthetic Data Metrics.
DataCebo, a start-up from CSAIL, MIT’s Computer Science and Artificial Intelligence Laboratory, has announced the creation of Synthetic Data (SD) Metrics, whose github is available, as part of its Synthetic Data Vault (SDV) project. This open source Python module was developed to help companies evaluate model-independent tabular data by comparing synthetic data sets to real data sets.
At the heart of data science
Researchers at MIT’s Laboratory for Information and Decision Systems (LIDS) were working on data science projects in 2013. When they wanted to test them on real datasets, they ran into various obstacles to accessing them, numerous security regulations and red tape. They decided to use synthetic data.
In 2016, in a paper describing the very first iteration of VDS, they introduced a new technique for synthesizing multi-table data and detailed their trials during which data scientists had successfully used synthetic data instead of real data for machine learning tasks.
After some pilot tests on enterprise applications, they released SDV as open source on PyPi for general use. Thus, the startup DataCebo was founded in 2020 by Kalyan Veeramachaneni, Neha Patki and Saman Amarsinghe with the main objective of developing the project.
The Synthetic Data Vault
The Synthetic Data Vault (SDV) is an ecosystem of synthetic data generation libraries that allows users to easily learn single-table, multi-table, and time series datasets to later generate new synthetic data with the same format and statistical properties as the original dataset.
This synthetic data can be used to supplement, augment and, in some cases, replace real data when training machine learning models. In addition, it allows testing of machine learning or other data-dependent software systems without the exposure risk associated with data disclosure.
It uses several graphical modeling and deep learning techniques, such as Copulas, CTGAN and DeepEcho.
Major banks, insurance organizations and clinical trial companies use models created with Copulas, which has been downloaded over a million times. CGTAN, a neural network-based model, has been downloaded over 500,000 times.
According to DataCebo, other datasets that have multiple tables or time series data are also supported.
Synthetic Data (SD) Metrics
The SD Metrics module defines metrics for statistics, efficiency and data privacy, generates visual reports that team members can share.
Because the SDMetrics library is model independent, it can be used with any synthetic data, regardless of the model that produced it.
Nehra Patki explains:
“For tabular synthetic data, there is a need to create metrics that quantify how the synthetic data compares to the actual data. Each metric measures a particular aspect of the data, such as coverage or correlation, which allows you to identify which specific elements were preserved or missed during the synthetic data process.”
The CategoryCoverage and RangeCoverage features can quantify whether a company’s synthetic data covers the same range of possible values as the actual data, the CorrelationSimilarity metric, as the name suggests, compares correlations.
More than 30 metrics are currently available, with more in development.