Data-Centric AI: Building AI Systems in the Real-World- NeurIPS 2023 Series

Jose Gabriel Islas Montero

December 20, 2023

min read

Data-Centric AI: Building AI Systems in the Real-World- NeurIPS 2023 Series

‍This post is part of our NeurIPS 2023 Series:

‍

In this article, we explore the key role of Data-Centric approaches in achieving real-world AI success.

‍

This post will update you on the highlights from NeurIPS 2023. It covers the importance of focusing on data for success, giving it more priority than model-centric approaches. Additionally, it explores the benefits of using a data-centric pipeline for creating benchmarks.

‍

Why Data-Centric is still the key to real-world AI success?
Data-Centric over Model-Centric evaluations
A Data-Centric pipeline to create better benchmarks
Conclusion

‍

1. Why Data-Centric is still the key to real-world AI success?

Data-Centric continues to solidify its presence in the field. At Tenyks, we have been advocating for a Data-Centric strategy as the right approach to building production grade ML systems.

‍

Over the years, we have observed that many of the pressing problems ML teams encounter in production are related to data (e.g., low-quality labelling, inconsistent labels across classes, or edge cases).

‍

We have summarized a few of the main issues ML engineers have to tackle as they start a brand new project in our article titled Where is my Model Failing? — A Data-Centric Approach. In addition to mentioning the challenges, we have showed how the Tenyks platform can be used to detect data issues through multi-modal embedding search.

‍

In another article, Sports Analytics — A Data-Centric Approach to Computer Vision, we argue why a Data-Centric approach can help identify the failing dataset Class that, when fixed, can yield the highest rewards.

‍

What’s crucial to note now is that, as new ideas such as Foundation Models become mainstream, the Data-Centric approach remains fundamental in enabling the development of machine learning systems [1].

‍

1.1 Data-Centric for reliable and responsible AI

It has become clear that high-quality data is required to build unbiased and robust models [1]; however, the struggle begins when an ML engineer begins to apply this approach in daily tasks.

‍

What are the main roadblocks during the process of systematically engineering a high-quality dataset? 🤔

Figure 1. NeurIPS 2023 Tutorial: “Data-Centric AI for reliable and responsible AI”. Source [3]

‍

In one of the earliest sessions (see Figure 1) at NeurIPS 2023, Mihaela van der Schaar, Professor of Machine Learning at the University of Cambridge, gave a fantastic talk addressing some of the common challenges encountered in creating high-quality datasets:

‍

1. Real-world data with errors leads to AI that is ingrained with errors:

ML models should be trained/tested considering the messy real-world data they will have to deal with at deployment.

‍

2. Collecting high quality data is very expensive:

Common roadblocks: i) Privacy requirements, ii) demographic data tends to be imbalanced [5], iii) internet only data is biased.

‍

With this in mind, we introduce one of the main works at NeurIPS 2023 with a focus on Data-Centric: DataPerf: Benchmarks for Data-Centric AI Development [2].

‍

2. Data-Centric over Model-Centric evaluations

2.1 The issue with Model benchmarks

The main problem highlighted in this work is that large public datasets, commonly used for benchmarking model performance [4], are often adopted by researchers without sufficient consideration of dimensions such as: breadth, difficulty, and fidelity to the underlying problem.

Datasets such as ImageNet serve as benchmarks for evaluating model performance.
The common practice of adopting large datasets without assessing their quality (e.g., poorly annotated datasets) can lead to performance discrepancies in models, resulting in (hidden) reduced accuracy.

‍

2.2 Where does this problem might come from?

In academic research and in general in school, often a curated and clean dataset is provided to you (see Figure 2). Then, the task becomes: “engineer the best model for this dataset”.

‍

Figure 2. Traditional Model-Centric approach for AI. Source [2]

‍

Common knowledge, to solve the above task, revolves around:

select the best model: deep learning or not deep learning approaches (e.g., decision trees, KNNs, etc).
apply training strategies: algorithms to optimize training, loss functions or approaches to achieve an effective regularization.

‍

These techniques represent some of the Model-Centric ways to build AI.

‍

Model-Centric has its place: it works great for model design and development. However, is this the right framework for developing AI in industry today?

‍

2.3 Working on real-world AI challenges

When putting together an ML system, your end users place less importance on the fancy strategies you followed to achieve near perfect accuracy. The ugly truth is that in the real-world the dataset is alive:the data is no longer a fixed dataset!

‍

Andrej Karpathy, former Director of AI at Tesla, has suggested that the key for Tesla to win over the competition was their Data Engine. He argues Tesla spends as much as 70% (or more) on the datasets.

‍

A common misconception, after finishing school, is that you cannot touch the data! 😱 In fact, for your AI system to achieve better performance in the real-world, you often start with a fixed model (e.g., YOLO v8, etc) and you are encouraged (and free) to modify, fix or improve the dataset! 😎 Only then you go back to the model or the training algorithms.

‍

🤔 But what if I have a dataset of 1M images? Well, do not panic! Start by checking our article on how to handle very large datasets!

‍

Now, can we really create Data-Centric benchmarks? Let’s explore what the authors of this work propose.

‍

3. A Data-Centric pipeline to create better benchmarks

To answer the previous question, the authors first developed a tool capable of engineering a higher quality dataset -so that better benchmarks can emerge.

‍

Figure 3 shows a Data-Centric pipeline: while the model architecture is locked in, the goal is to iterate over the dataset across six different but related data stages.

‍

3.1 Six components to iterate on your data

Figure 3. Data-Centric pipeline: set of data oriented operations (in green) [2]

‍

On the basis of this pipeline, they proposed DataPerf, an open-source platform for hosting and proposing new data-centric benchmarks. Some of the benchmarks developed by this work include tasks such as speech and vision training, data cleaning, and diffusion model prompting.

‍

3.2 DataPerf for Vision: selection

DataPerf introduces a challenge related to a data selection algorithm with a specific focus on vision-centric tasks.

Task description: The task involves the development of a data selection algorithm with the capability to choose effective training samples from a large pool of candidate images.
Benchmark evaluations: The vision-selection-algorithm benchmark evaluates binary classification of visual concepts (e.g., “monster truck” or “jean jacket”) in unlabeled images.
Application examples: Familiar production examples of similar models include automatic labeling services by Amazon Rekognition, Google Cloud Vision API, and Azure Cognitive Services.

‍

Other challenges including Data Acquisition challenges and examples can be found on the original paper [2].

‍

4. Conclusion

While Model-Centric is an effective approach for designing and developing models, it might not be the right way for building robust ML systems: in the real-world, data is messy, dynamic and massive.

‍

In this post we make the case of why to truly leverage a Data-Centric approach, we need to switch from Model-Centric benchmarks to Data-Centric ones.

‍

Getting Data-Centric right [3] can provide clarity and precision in articulating and gathering data:

We can define with more certainty the kind of data we need to collect.
If we need synthetic data we know we can define it.
In healthcare: we can collect just the data you need to answer what you need to answer.

‍