Phong Nguyen, Chief AI Officer, attended Worldwide AI Webinar and discussed data-centric for practical AI. He presented some challenges and approaches to data-centric AI and explained why the focus should be on data-centric systems.
3 major challenges of turning ML research into production
According to Phong, for ML research to turn into an actual application, a company requires a lot of different skills from personnel with a practical mindset who aim to solve the problem rather than study it.
Data and quality of data
Phong claimed that this was the challenge he had always wanted to fix. In research, there’s always a benchmark dataset, which is provided by someone else. However, in industry, measures of data are not ready to be used for machine learning.
A lot of corporations and companies nowadays were still trying to build their platform and verify their data quality in order to apply machine learning and other analytics frameworks, as Phong stated.
Automation in AI and ML process
To mask and produce the algorithms and applications to scale up, Phong believed that we needed a lot of automated tools to work with AI and ML.
Why should we focus on data-centric AI?
Phong mentioned three reasons why a shift to data-centric AI was crucial for every enterprise that is looking to develop an AI/ML system.
First of all, ML in research and ML in production are different, especially data-wise since data in research was static while constantly shifting in distribution in production, which isn’t uncommon.
He also stated that data was often the biggest challenge in developing and deploying AI. The data pipelines took up the largest portion of the framework for ML systems and it’s reported that data preprocessing usually consumed of 80% of data scientists’ time.
Finally, in industrial applications, trustworthy AI required trustworthy data. Since machine learning models generalize from historical data and they usually face data quality issues such as label mistakes, missing or messy or constantly shifting data, or gaps in coverage, quality training data at scale for production ML systems is vital.
Approaches to data-centric AI
Building a strong AI foundation
The first approach to data-centric AI according to Phong is building a strong AI foundation from various data pipelines, data techniques, data ingestions, data governance, data quality, data storage, data processing, and data consumption.
Companies would also have to work with data warehouses, data lakes, cloud data lakes and hybrid multi-cloud data AI platforms as well as create processes in order to ensure the quality of the data.
Data strategy during training is additionally an important part of data-centric approach as it can help you to work with a limited data set and ensure the model would have higher performance and accuracy.
At the deployment stage, it’s essential to monitor errors, biases and data drift. In some of the industrial applications, it's crucial to have explainable AI where you use many techniques to unbox the AI black box.
Data-centric AI competitions
As Phong shared, FPT Software promotes data-centric AI by holding regular AI competitions, which focus on developing all the techniques for improving the quality of data, to ask the participants to search and optimize their models to get the best performance.
During these contests, they used various effective augmentation techniques, such as blur, contrast, zoom, hue, crop, translation, mosaic, rotation, edge case, mixup, and cut out, which help improve the performance of the model.
Phong Nguyen gave a few detailed examples of how FPT Software has applied a data-centric approach in their data science projects, which you can discover more about by watching his speech on our website
and YouTube channel
There are three points the Chief AI Officer of FPT Software would like the audience to take away from his keynote:
Data-centric is a key part of practical AI/ML
In Data-Centric, model performance is achieved by tuning and improving data quality while keeping constant Al/ML algorithms
In the Data-Centric Al approach, data quality must be maintained and improved in all stages of Al/ML project lifecycle.