The AI-driven company: data challenge

Photo by Franki Chamaki on Unsplash

When interviewing the people for our study on industrial AI adoption, it became clear that some or even many especially view LLMs and multi-model models as a mechanism to leapfrog inherent weaknesses in their company. One of the key areas is data. I keep getting surprised by how poorly many companies deal with their data. Even if you can use AI agents for generic tasks, when you seek to integrate them in your workflows or simply intend to use them as assistants, they need to have access to the information and data stored within the company.

For any AI solution to work properly, the data given to it for training or inference needs to be of sufficient quality. The adage of garbage in, garbage out is as true for AI as it is for any other situation. The challenge is that many companies may treat their software with a level of professionalism, including proper testing, versioning and certification, but their data is often managed at not nearly the same level. Individual teams can start and stop data collection at will, services can be based on raw data without proper cleaning and semantic definition and, especially in software-intensive systems companies, the data is often used for quality assurance and diagnostics and lacks contextual data necessary for training machine learning models.

When taking a step back and looking to bring some structure into these challenges, we can identify at least five major areas of concern: quality, access, volume, dynamism and security. First, as mentioned, data quality is a major factor in many companies I meet. Typically, the data collected is ‘raw’ in that it isn’t cleaned and checked for consistency. Especially when it’s gathered by devices in the field, all kinds of situations can cause data records to be inconsistent. Most companies store data in its raw form as cleaning and processing tend to be use case specific. However, failing to process that data for a specific use case and simply using it in its raw form can lead to significant issues. For instance, customer-facing data-driven services may show highly inconsistent data.

A second quality aspect is the frequency and granularity of the data. If data is collected by aggregating every second, minute or hour, any use case that requires a higher frequency is impossible to pursue. Similarly, data that’s mapped to, for instance, low, medium or high, such as for engine temperature in an ICE vehicle, can’t be used for use cases that require a quantitative measurement.

Especially for training ML models, companies often fail to collect sufficient contextual data. Using engine temperature as an example, it may be good to know that an engine overheated at some point, but if you don’t collect contextual data to determine under what circumstances overheating happened, you won’t be able to train an ML model.

The second major challenge is data access. As people and teams become aware of data being the new oil, we see a tendency to protect access from others in the company or beyond. Legislation such as GDPR and the Data Act reinforce this behavior and teams and functions can easily hide behind them to avoid having to provide access. Many AI use cases, however, require access to often highly disparate data sources to yield valuable inference results. As especially agentic AI can relate factors that we, as humans, are unable to do, these agents do need access to data, which can be a challenge for technical and organizational reasons.

One aspect often underestimated is that a team that’s asked to provide access to their data is required to add a significant amount of work and responsibility. As soon as others start to rely on their data, the team is expected to provide data of sufficient quality and ensure that the data stream keeps flowing and the data semantics don’t randomly change in response to the team changing its priorities. When data needs to be productized, as in this case, few recognize that this comes with a cost that somebody has to pick up the bill for.

The volume of all the data collected and stored easily becomes a significant challenge as well. Some of the companies we interviewed store well over 100 petabytes of data. Cloud solutions offer more than enough space for this, but the associated cost, especially for downloading the data or processing it in the cloud, rapidly becomes a major burden. At the same time, many of the companies we interviewed have a strong “just in case” mindset, meaning that they store vast amounts of data just in case it may come in handy in the future.

In addition to the cost and practicalities of storing the data, there’s also the issue of simply accessing and using it. Finding the right data, ensuring that the semantics are correct and avoiding using unsuitable data is proving to be a major challenge for companies.

The fourth challenge many companies struggle with is dynamic data and associated model drift. In many contexts, the data collected from systems changes over time. In e-commerce, different seasons exhibit different buying behavior from customers. In automotive, vehicles behave differently during different times of the year due to, for instance, changes in outside temperature. When machine learning models are trained statically and deployed, their performance will deteriorate over time. The answer typically is to maintain a window with the most recent data, preferably labelled, that can be used to retrain models periodically. The challenge often is to determine if certain poor results are anomalous or indicative of domain creep, requiring retraining.

Finally, there’s the notion of data security. This is a multi-faceted challenge. On the one hand, it includes maintaining regulatory compliance by limiting access to data and ensuring that there are no data leaks via AI models or otherwise. On the other hand, it’s also concerned with data poisoning, where malicious actors seek to inject bad data into the system to affect model performance. Whereas a human data analyst may be able to spot this, ML models and AI agents are quite sensitive to data poisoning.

There are many solutions available to address these challenges. Data validation pipelines and anomaly detection, data sharing agreements, schema mapping tools to describe data semantics, encryption, federated learning, monitoring, periodic retraining and cost-benefit analysis are only some examples. However, not all companies are using these techniques and many of the interviewees comment on some or most of the aforementioned challenges.

Adopting AI models and AI agents requires companies to embark on a journey to mature their data practices, which is often challenging as it requires many decisions on what to collect, at what frequency, where to store it, how to process it and how to make sure it can be used both for training and inference. The main challenges companies experience include quality, access, volume, dynamism and security. Solutions exist to address these, but they come with costs and require effort and key competencies. Frequently, at least one of these factors is a bottleneck. In the end, to quote Tim Berners-Lee, “data is a precious thing that will last longer than the systems themselves.”

Want to read more like this? Sign up for my newsletter at jan@janbosch.com or follow me on janbosch.com/blog, LinkedIn (linkedin.com/in/janbosch) or X (@JanBosch).