{"id":1076,"date":"2020-04-28T13:31:12","date_gmt":"2020-04-28T13:31:12","guid":{"rendered":"http:\/\/janbosch.com\/blog\/?p=1076"},"modified":"2020-04-28T13:31:19","modified_gmt":"2020-04-28T13:31:19","slug":"ai-engineering-part-2-data-versioning-and-dependency-management","status":"publish","type":"post","link":"https:\/\/janbosch.com\/blog\/index.php\/2020\/04\/28\/ai-engineering-part-2-data-versioning-and-dependency-management\/","title":{"rendered":"AI engineering part 2: data versioning and dependency management"},"content":{"rendered":"\n<figure class=\"wp-block-image\"><img loading=\"lazy\" decoding=\"async\" width=\"1024\" height=\"683\" src=\"https:\/\/janbosch.com\/blog\/wp-content\/uploads\/2020\/04\/markus-spiske-466ENaLuhLY-unsplash-1024x683.jpg\" alt=\"\" class=\"wp-image-1077\" srcset=\"https:\/\/janbosch.com\/blog\/wp-content\/uploads\/2020\/04\/markus-spiske-466ENaLuhLY-unsplash-1024x683.jpg 1024w, https:\/\/janbosch.com\/blog\/wp-content\/uploads\/2020\/04\/markus-spiske-466ENaLuhLY-unsplash-300x200.jpg 300w, https:\/\/janbosch.com\/blog\/wp-content\/uploads\/2020\/04\/markus-spiske-466ENaLuhLY-unsplash-768x512.jpg 768w\" sizes=\"(max-width: 1024px) 100vw, 1024px\" \/><figcaption>Photo by Markus Spiske on Unsplash<\/figcaption><\/figure>\n\n\n\n<p>In my <a href=\"https:\/\/janbosch.com\/blog\/index.php\/2020\/04\/22\/ai-engineering-making-ai-real\/\">last column<\/a>,  I presented our research agenda for AI engineering. This time, we\u2019re  going to focus on one of the topics on that agenda, ie data versioning  and dependency management. Even though the big data era has been with us  for over a decade now, many of the companies that we work with are  still struggling with their data pipelines, data lakes and data  warehouses.<\/p>\n\n\n\n<p>As we mostly work with the embedded systems industry in the B2B  space, one of the first challenges many companies struggle with is  access to data and ownership issues. As I discussed in an <a href=\"https:\/\/janbosch.com\/blog\/index.php\/2019\/11\/15\/get-your-data-out-of-the-gray-zone\/\">earlier post<\/a>,  the key thing is that rather than allowing your data to exist in some  kind of grey zone where it\u2019s unclear who owns what, it\u2019s critical to  address questions around access, usage and ownership of data between  your customers and your company. And of course, we need to be clear and  transparent on the use of the data, as well as how the data is  anonymized and aggregated before being shared with others.<\/p>\n\n\n\n<p>The second challenge in this space is associated with the increasing \nuse of DevOps. As data generation is much less mature as a technology \nthan, for instance, API management in software, teams tend to make \nrather ad-hoc changes to the way log data is generated as they believe \nthey\u2019re the only consumers of the data and it\u2019s only being used by them \nto evaluate the behavior of the functionality that the team is working \non. Consequently, other consumers of the data tend to experience \nfrequent disruptions of the data stream, as well as its content.<\/p>\n\n\n\n<p>The frequent changes to data formats and ways of generation is \nespecially challenging for machine learning (ML) applications as the \nperformance of the ML models is highly dependent on the quality of the \ndata. So, changes to the data can cause unexpected degradations of \nperformance. Also, as ML models tend to be very data hungry, we \ntypically want to use large data sets for training and, consequently, \ncombine the data from multiple sprints and DevOps deployments into a \nsingle training and validation data set. However, if the data generated \nby each deployment is subtly (or not so subtly) different, that can \nbecome challenging.<\/p>\n\n\n\n<p>The third challenge is that data pipelines tend to have implicit \ndependencies that can unexpectedly surface when implementing changes or \nimprovements. Consumers of data streams can suddenly be switched off and\n as there typically is a significant business criticality associated \nwith the functionality implemented by the consumer, this easily leads to\n firefighting actions to get the consumer of the data back online. \nHowever, even if this may be a nice endorphin kick for the cowboys in \nthe organization, the fact of the matter is that we shouldn\u2019t have \nexperienced these kinds of problems, to begin with. Instead, the parties\n generating, processing and consuming data need to be properly governed \nand the evolution of the pipeline and its contents should be coordinated\n among the affected players.<\/p>\n\n\n\n<p>These are just some of the challenges associated with data management. In <a href=\"https:\/\/scholar.google.se\/scholar?hl=en&amp;as_sdt=0,5&amp;cluster=8579068069408331806\">earlier research<\/a>,\n we\u2019ve provided a comprehensive overview of the data management \nchallenges. In our current research, we\u2019re working on a domain-specific \nlanguage to model data pipelines, including the processing and storage \nnodes, as well as their mutual connectors. The long-term goal is to be \nable to generate operational pipelines that include monitoring solutions\n that can detect the absence of data streams, even in case of batch \ndelivery of data, as well as a host of other deviations.<\/p>\n\n\n\n<p>In addition, we\u2019ve worked on a \u201cdata linter\u201d solution that can warn \nwhen the content of the data changes, ranging from simple changes such \nas missing or out-of-range data to more complicated ones such as \nshifting statistical distributions over time. The solution can warn, \nreject data and trigger mitigation strategies that address the problems \nwith the data without interrupting the operations. Please contact me if \nyou\u2019d like to learn more.<\/p>\n\n\n\n<p>Concluding, data management, including versioning and dependencies,  is a surprisingly complicated topic that many companies haven\u2019t yet  wrestled to the ground. The difference in maturity between the way we  deal with software and with data is simply staggering, especially in  embedded systems companies where data traditionally was only used for  defect management and quality assurance. In our research, we work with  companies to make a step function change to the way data is collected,  processed, stored, managed and exploited. As data is the new oil,  according to some, it\u2019s critical to take it as seriously as any other  asset that you have available in your business.<\/p>\n\n\n\n<p><em>To get more insights earlier, sign up for my newsletter at&nbsp;<\/em><a rel=\"noreferrer noopener\" href=\"https:\/\/mailto:jan@janbosch.com\/\" target=\"_blank\"><em>jan@janbosch.com<\/em><\/a><em> or follow me on<\/em><a rel=\"noreferrer noopener\" href=\"https:\/\/janbosch.com\/blog\" target=\"_blank\"> <em>janbosch.com\/blog<\/em><\/a><em>, LinkedIn (<\/em><a rel=\"noreferrer noopener\" href=\"https:\/\/www.linkedin.com\/in\/janbosch\/\" target=\"_blank\"><em>linkedin.com\/in\/janbosch<\/em><\/a><em>) or Twitter (<\/em><a rel=\"noreferrer noopener\" href=\"https:\/\/twitter.com\/JanBosch\" target=\"_blank\"><em>@JanBosch<\/em><\/a><em>).<\/em><\/p>\n","protected":false},"excerpt":{"rendered":"<p>In my last column, I presented our research agenda for AI engineering. This time, we\u2019re going to focus on one of the topics on that agenda, ie data versioning and dependency management. Even though the big data era has been with us for over a decade now, many of the companies that we work with &#8230; <a title=\"AI engineering part 2: data versioning and dependency management\" class=\"read-more\" href=\"https:\/\/janbosch.com\/blog\/index.php\/2020\/04\/28\/ai-engineering-part-2-data-versioning-and-dependency-management\/\" aria-label=\"Read more about AI engineering part 2: data versioning and dependency management\">Read more<\/a><\/p>\n","protected":false},"author":2,"featured_media":0,"comment_status":"open","ping_status":"open","sticky":false,"template":"","format":"standard","meta":{"generate_page_header":"","footnotes":""},"categories":[4,8,10],"tags":[],"_links":{"self":[{"href":"https:\/\/janbosch.com\/blog\/index.php\/wp-json\/wp\/v2\/posts\/1076"}],"collection":[{"href":"https:\/\/janbosch.com\/blog\/index.php\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/janbosch.com\/blog\/index.php\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/janbosch.com\/blog\/index.php\/wp-json\/wp\/v2\/users\/2"}],"replies":[{"embeddable":true,"href":"https:\/\/janbosch.com\/blog\/index.php\/wp-json\/wp\/v2\/comments?post=1076"}],"version-history":[{"count":1,"href":"https:\/\/janbosch.com\/blog\/index.php\/wp-json\/wp\/v2\/posts\/1076\/revisions"}],"predecessor-version":[{"id":1078,"href":"https:\/\/janbosch.com\/blog\/index.php\/wp-json\/wp\/v2\/posts\/1076\/revisions\/1078"}],"wp:attachment":[{"href":"https:\/\/janbosch.com\/blog\/index.php\/wp-json\/wp\/v2\/media?parent=1076"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/janbosch.com\/blog\/index.php\/wp-json\/wp\/v2\/categories?post=1076"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/janbosch.com\/blog\/index.php\/wp-json\/wp\/v2\/tags?post=1076"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}