One of the things that keep surprising me over and over again is how much effort companies spend on processing, cleaning, converting and preparing data. For the companies that I work with, the data science teams easily spend 90-95 percent of their time just preparing data for use in machine learning/deep learning deployments. This is caused by several challenges, of which data silos, lack of labeled data, unbalanced training sets, training/serving skew and unstable data pipelines are the most common ones.
Data silos are concerned with the typical situation that every team and department conduct their own data collection for their own purposes. As many companies use DevOps, continuous deployment or some type of frequent, periodic delivery, the semantics of the data tends to change with every new version of the software. That makes it difficult to use data collected over longer periods for training, requiring data scientists and engineers to manually review the data from every period and convert it to create a larger, homogeneous set that can be used for training.
The second challenge is that although most companies have vast amounts of data available, typically very little of that data is labeled. That means that data scientists and engineers have to do that manually or find other, automated ways to label data as it is generated. Depending on the task, this can be very time consuming and challenging. For instance, in the case of prediction, the time delay between the prediction and the actual outcome becoming known can be quite significant, complicating the labeling efforts.
Third, in many cases, the ratio between different classes of activities is amazingly large. For instance, in the case of preventive maintenance applications, machines may perform without issues for months or years on end before reaching a state where they fail. This means that the data for the training set is highly unbalanced. Naively training with this data may easily cause a situation where the system never indicates an upcoming maintenance issue, having an accuracy in the high 99 percent range as most of the time it doesn’t require preventive maintenance.
The unbalanced training sets, of course, need to be balanced by humans for training purposes but this leads to the fourth challenge, training/serving skew. Due to the large amount of manual work data scientists and engineers put in the data sets, among other things, a relevant gap can easily emerge between model performance in training and in operation. Some companies use A/B testing as a mechanism to avoid underperforming models receiving too much traffic but that’s not always feasible.
Finally, data pipelines have traditionally been treated with much less mature engineering than code pipelines. This leads to a frequent occurrence of data sources not being updated or providing data with different semantics. The most dangerous case of this is when the engineers have inserted defensive mechanisms in the data pipeline, serving the model with stale data under the assumption that any data stream-related issues are temporary. If a part of the pipeline then permanently breaks, the model may underperform or not perform at all but it may still take a long time for this to surface.
To address these issues, the concept of DataOps is adopted in the research and industry communities. The notion is to automate, instrument and maintain the data pipeline for analytics and ML/DL. There even is a manifesto with 18 principles!
Concluding, with many companies increasingly depending on data, analytics and AI, the quality of the data and the stability of data pipelines is of increasing importance. DataOps may well be a key for achieving what we refer to as operational AI. Make sure you give your data the attention it deserves!