AI engineering part 2: data versioning and dependency management

In my last column, I presented our research agenda for AI engineering. This time, we’re going to focus on one of the topics on that agenda, ie data versioning and dependency management. Even though the big data era has been with us for over a decade now, many of the companies that we work with are still struggling with their data pipelines, data lakes and data warehouses.

As we mostly work with the embedded systems industry in the B2B space, one of the first challenges many companies struggle with is access to data and ownership issues. As I discussed in an earlier post, the key thing is that rather than allowing your data to exist in some kind of grey zone where it’s unclear who owns what, it’s critical to address questions around access, usage and ownership of data between your customers and your company. And of course, we need to be clear and transparent on the use of the data, as well as how the data is anonymized and aggregated before being shared with others.

The second challenge in this space is associated with the increasing use of DevOps. As data generation is much less mature as a technology than, for instance, API management in software, teams tend to make rather ad-hoc changes to the way log data is generated as they believe they’re the only consumers of the data and it’s only being used by them to evaluate the behavior of the functionality that the team is working on. Consequently, other consumers of the data tend to experience frequent disruptions of the data stream, as well as its content.

The frequent changes to data formats and ways of generation is especially challenging for machine learning (ML) applications as the performance of the ML models is highly dependent on the quality of the data. So, changes to the data can cause unexpected degradations of performance. Also, as ML models tend to be very data hungry, we typically want to use large data sets for training and, consequently, combine the data from multiple sprints and DevOps deployments into a single training and validation data set. However, if the data generated by each deployment is subtly (or not so subtly) different, that can become challenging.

The third challenge is that data pipelines tend to have implicit dependencies that can unexpectedly surface when implementing changes or improvements. Consumers of data streams can suddenly be switched off and as there typically is a significant business criticality associated with the functionality implemented by the consumer, this easily leads to firefighting actions to get the consumer of the data back online. However, even if this may be a nice endorphin kick for the cowboys in the organization, the fact of the matter is that we shouldn’t have experienced these kinds of problems, to begin with. Instead, the parties generating, processing and consuming data need to be properly governed and the evolution of the pipeline and its contents should be coordinated among the affected players.

These are just some of the challenges associated with data management. In earlier research, we’ve provided a comprehensive overview of the data management challenges. In our current research, we’re working on a domain-specific language to model data pipelines, including the processing and storage nodes, as well as their mutual connectors. The long-term goal is to be able to generate operational pipelines that include monitoring solutions that can detect the absence of data streams, even in case of batch delivery of data, as well as a host of other deviations.

In addition, we’ve worked on a “data linter” solution that can warn when the content of the data changes, ranging from simple changes such as missing or out-of-range data to more complicated ones such as shifting statistical distributions over time. The solution can warn, reject data and trigger mitigation strategies that address the problems with the data without interrupting the operations. Please contact me if you’d like to learn more.

Concluding, data management, including versioning and dependencies, is a surprisingly complicated topic that many companies haven’t yet wrestled to the ground. The difference in maturity between the way we deal with software and with data is simply staggering, especially in embedded systems companies where data traditionally was only used for defect management and quality assurance. In our research, we work with companies to make a step function change to the way data is collected, processed, stored, managed and exploited. As data is the new oil, according to some, it’s critical to take it as seriously as any other asset that you have available in your business.

To get more insights earlier, sign up for my newsletter at jan@janbosch.com or follow me on janbosch.com/blog, LinkedIn (linkedin.com/in/janbosch) or Twitter (@JanBosch).