Synthetic Data: Diffusion Models - NeurIPS 2023 Series

Jose Gabriel Islas Montero

December 21, 2023

min read

Synthetic Data: Diffusion Models - NeurIPS 2023 Series

‍This post is part of our NeurIPS 2023 Series:

‍

In today’s data-driven landscape, the challenges of acquiring and utilizing quality data for machine learning applications are evident.

‍

This article takes a look at the dynamics of the “data problem,” exploring roadblocks in synthetic data and introducing a technique based on Generative AI or GenAI, in particular one work on diffusion-models presented at NeurIPS 2023, as a potential solution.

‍

The data problem
Roadblocks in synthetic data generation
Synthetic data with Diffusion Models
Conclusion

‍

1. The data problem

Machine Learning relies profoundly on data to train models, make predictions, and uncover patterns. However, this dependency on data presents a myriad of hurdles and challenges (see Figure 1) that impact the effectiveness of machine learning systems.

‍

Figure 1. The Data Problem: day to day hurdles and challenges in AI

‍

One commonly cited approach to tackle the “data problem” is to generate your own data, often referred as synthetic data.

‍

What do we mean by Synthetic Data?

“Generally, synthetic data is defined as the artificially annotated information generated by computer algorithms or simulations” [1].

In many instances, the need for synthetic data arises when actual data is either inaccessible or requires confidentiality due to privacy or compliance concerns [1]. Think of credit card 💳 or health-related information: how can you train a model when you lack vast amounts of balanced data that includes edge cases? 🤔

‍

Synthetic data is often viewed merely as a means to obtain additional data for downstream tasks (e.g., object detection, classification). However, other use cases, such as enhancing fairness (i.e., making a model more robust to fairness issues), receive less attention [3].

‍

With this background in mind, let’s explore some of the main challenges to generate synthetic data.

‍

Synthetic Data + Data-Centric

At Tenyks, we have seen first-hand the challenges that companies face when acquiring high-quality data. Imagine that you followed Tenyk’s Data-Centric approach to successfully identify where your model is failing. Now, what if you need more training data? 🤔

‍

Real-world data, in many cases, is scarce; not all tasks enjoy the vast amount of data that applications such as autonomous driving provide. Hence, what alternatives does your ML team have?

i) Self-collect more data — but it’s expensive and time-consuming.
ii) Buy data from third-parties — but it’s impossible to inspect the quality of all the samples.

‍

One less explored avenue is synthetic data. For instance, NeuroLabs provides high-quality synthetic data in the retail domain. A few ways you can use this generated data in a Data-Centric approach are as follows:

i) Enhancing training data. Synthetic data can complement real-world data, increasing the diversity of your dataset.
ii) Reducing data scarcity. As we mentioned, for some tasks (e.g., health) collecting or buying data is not an option. Synthetic data mitigates data scarcity by creating additional training examples however.
iii) Validation and testing. Assume you have enough training data but not enough validation and testing samples. Or, imagine you have data that does not include the edge cases you are interested in. For these scenarios, generating synthetic data might be a viable alternative.

‍

2. Roadblocks in synthetic data generation

If the traditional way to acquire and curate high-quality data is already a consuming task, using algorithms to generate high-quality data isn’t an easier task [2].

‍

Based on Yingzhou Lu et al [1], Table 1 summarizes three of the main challenges of synthetic data generation:

Table1. Challenges in synthetic data generation

‍

Another crucial roadblock highlighted in [3] is the inability of synthetic data to accurately represent the intricate nuances of real-world data, especially in situations where such complexities may impact deployment.

‍

3. Synthetic data with Diffusion Models

3.1 A diffusion-based generative model pretrained on a generic dataset

Despite the hurdles, there is optimism about the future impact of synthetic data.

‍

One example introduced in [2] at NeurIPS 2023 workshop SynheticData4ML, presents a diffusion-based method for synthetically generating images depicting emergency vehicles in road scenes.

‍

Figure 3. Enhancing real images using a super-resolution model guided by random text prompts. No masking needed; entire image can be modified. Conditions like weather and time of day can be randomized for varied image versions. Source [2]

‍

The main idea of this approach, Figure 2, involves a multi-step process for synthetic data generation:

Pretraining and Fine-tuning
— A pretrained diffusion model is fine-tuned on a generic dataset, even if it lacks infrequent target objects.
— The conditioning of the diffusion process on text is achieved using a CLIP model, perturbing the denoising process based on the gradient of the dot product of the image and text encoding.
Image Manipulation
— Three different image manipulation approaches are explored, leveraging the fine-tuned model. These approaches facilitate the generation of synthetic images containing a diverse range of infrequent objects of interest.
Training Downstream Models
— The synthetic images, enriched with infrequent objects, are utilized for training downstream object detection models.
Super-resolution Enhancement
— A text-conditioned super-resolution diffusion model is incorporated into the pipeline to enhance the resolution of the generated images.
Data Assumption
— The approach is built on the assumption that a small but domain-relevant real dataset is available.

‍

To sum up, the work combines pretrained models, text conditioning, image manipulation, and super-resolution techniques to generate diverse synthetic images. These synthetic images are then employed for training object detection models, with the process anchored in a small real dataset to maintain relevance to the target domain.

‍

4. Conclusions

Synthetic data generation stands as a promising solution, holding immense potential to overcome the labelling challenges inherent in machine learning development.

‍

Despite ongoing challenges in synthetic data, the advancements showcased at NeurIPS 2023, particularly through GenAI, show a positive trajectory for this area.

‍