Good data: the foundation of useful AI

29 March 2022

Gary Allemann, managing director, Master Data Management

Gary Allemann, managing director, Master Data Management

Investment into Artificial Intelligence (AI) skyrocketed as the world scrambled to understand, predict and model Covid-19. However, the velocity and volume of investment failed to achieve any helpful outcomes, and in some cases even worsened them. This highlights the role that data plays in training AI – not just the amount of data, but the quality of it and the ability to map data lineage. Without the right data, AI produces poor insight, learns the wrong things and can actually do more harm than good.

Unexpected consequences of Covid-19

The pandemic plunged the world into a spiralling health crisis that was unprecedented and, as a result, poorly understood. In an attempt to assist, AI teams around the world stepped in, building hundreds of predictive tools intended to help hospitals diagnose or triage Covid-19 patients faster. Yet, in spite of this massive investment, none of them helped. In fact, some were potentially harmful.

In this specific example, numerous studies show that researchers across the world repeated the same basic errors in the way that they trained and tested their models. At the time that these models were being built, only public data sets were available. In many cases these were poorly labelled, or from unknown sources. As a result, duplicate data sets were in some instances consolidated, skewing outcomes.

In others, the same data was used to both develop and test models, making them appear more accurate. Some patient data was included that had nothing to do with Covid-19 – again training the models to make invalid conclusions. A more insidious and subtle problem was the biases built into some data sets – for example, some data sets were labelled as Covid-19 diagnoses, causing their diagnostics to be weighted more heavily in the models.

All about the data

Covid-19 has clearly highlighted the importance that data plays in AI. While most industries will not face potentially life-threatening consequences resulting from poorly-trained or executed AI models, the business repercussions could still be dire. Many of the issues uncovered in the example of the models developed to aid in Covid-19 diagnoses comes down to a poor understanding of the data, and its quality.

AI has a significant role to play for many businesses in the future, but it will continue to struggle to deliver quality results unless data management foundations are in place. It is essential to apply basic data management principles, including metadata management and data quality. It is also imperative that data sets are standardised and that the data used to train AI models is clean and its lineage can be trusted.

Addressing the problem at source

As AI and machine learning become more prevalent, data takes on a more important role than ever, as the source code for machine-driven insight. This ultimately means that without addressing data quality and data management challenges, including identifying biases in the data, creating useful AI models is all but impossible.

If models are trained on poor data, they will inevitably make poor decisions. If businesses make use of the wrong data, they will obtain the wrong insights. Data quality processing is essential to debugging data that underlies AI and machine learning predictions.