AI is not about data sets

Photo by wu yi on Unsplash

As I’m spending an increasing amount of time in the AI field with a variety of companies, I’ve noticed an interesting misconception in the ML/DL space. Many have a tendency to focus on data sets, experimenting with different models using a specific set of data and, finally, deploying a model in a specific context. This approach is very linear and ‘waterfall-ish’.

It’s easy to understand why many end up here as most data scientists spend the vast majority of their time evaluating, pre-processing and cleaning data sets and the thought of having to do this continuously is less than appealing. Instead, many data scientists want to create a model of satisfactory accuracy and move on to the next project.

This is a rather naive and simplistic viewpoint. In the successful, real-world systems that I’ve been exposed to, teams use an iterative approach where new data is constantly collected, the model is periodically updated and constantly retrained and, finally, the updated and retrained model is deployed in production with the same DevOps heartbeat that all the system’s components use. Model deployment often occurs in some form of A/B testing environments to validate that the new version is not only performing better in training but indeed in practice as well.

The point is that AI and specifically ML/DL isn’t about creating an ML/DL model by laboriously processing and cleaning a specific data set, but rather to build a system where the data collected during the current deployment can be used to update and train the model for the next deployment. Once the creation of this system is the focus, the focus of the team will shift as well. There are several changes to consider.

First, the generation of data during operation should focus on, preferably, fully automated collection and subsequent preparation for and use in training. Many data scientists spend almost all their time cleaning and pre-processing data and then train using standard algorithms from a variety of publicly available libraries. Consequently, many find it difficult to imagine a world where data can be used for training without any human interaction. However, when building this system, that’s exactly what the focus should be.

Second, once deployed, the model should provide mechanisms to continuously track its performance. As models are updated and retrained repeatedly, we need to verify during training as well as during operations that they’re performing well.

Third, subsequent versions of models should be evaluated to ensure that continuous improvement is achieved – not only in training but especially also in operation. This is where A/B testing or multi-armed bandit (MAB) systems come in, allowing you to feed the newly deployed model a small slice of the data traffic and to crank this up over time as the performance proves to be better than the old model.

Finally, we need a hierarchical value model to link the performance of ML/DL models at the component or subsystem level to the system and customer level value delivery. ML/DL models operate in the context of a larger system and don’t just need to perform in their own scope, but instead contribute to delivering customer value at the system level. Connecting their performance with the overall system performance and ensuring that we avoid the local optimization trap is critical for long-term success.

Concluding, although many working in the AI space focus a lot on data sets, if only because people spend most of their time working on data sets, especially ML/DL isn’t about that. Instead, it’s about building a DevOps system where the ML/DL components are constantly improved and retrained based on the most recent data collected using the latest deployment. Getting that system right, both for the ML/DL models and for the system as a whole, is the real challenge.

To get more insights earlier, sign up for my newsletter or follow me on, LinkedIn ( or Twitter (@JanBosch).

One thought on “AI is not about data sets

  1. Hi Jan,
    Has these ideas been tested in real cases ?
    I would like to see how this can be applied in Intelligent Transport Systems
    and Automated Driving System and have a use case and a Proof-of-concepts plattform to be used.

Comments are closed.