{"id":967,"date":"2019-09-25T17:12:52","date_gmt":"2019-09-25T17:12:52","guid":{"rendered":"http:\/\/janbosch.com\/blog\/?p=967"},"modified":"2019-09-25T17:13:07","modified_gmt":"2019-09-25T17:13:07","slug":"dataops-the-key-to-operational-ai","status":"publish","type":"post","link":"https:\/\/janbosch.com\/blog\/index.php\/2019\/09\/25\/dataops-the-key-to-operational-ai\/","title":{"rendered":"DataOps: the key to operational AI"},"content":{"rendered":"\n<figure class=\"wp-block-image\"><img loading=\"lazy\" decoding=\"async\" width=\"1024\" height=\"681\" src=\"https:\/\/janbosch.com\/blog\/wp-content\/uploads\/2019\/09\/nasa-Q1p7bh3SHj8-unsplash-1024x681.jpg\" alt=\"\" class=\"wp-image-969\" srcset=\"https:\/\/janbosch.com\/blog\/wp-content\/uploads\/2019\/09\/nasa-Q1p7bh3SHj8-unsplash-1024x681.jpg 1024w, https:\/\/janbosch.com\/blog\/wp-content\/uploads\/2019\/09\/nasa-Q1p7bh3SHj8-unsplash-300x200.jpg 300w, https:\/\/janbosch.com\/blog\/wp-content\/uploads\/2019\/09\/nasa-Q1p7bh3SHj8-unsplash-768x511.jpg 768w\" sizes=\"(max-width: 1024px) 100vw, 1024px\" \/><figcaption>Photo by NASA on Unsplash\n<\/figcaption><\/figure>\n\n\n\n<p>One of the things that keep surprising me over and over again is how much effort companies spend on processing, cleaning, converting and preparing data. For the companies that I work with, the data science teams easily spend 90-95 percent of their time just preparing data for use in machine learning\/deep learning deployments. This is caused by several challenges, of which data silos, lack of labeled data, unbalanced training sets, training\/serving skew and unstable data pipelines are the most common ones.<\/p>\n\n\n\n<p>Data silos are concerned with \nthe typical situation that every team and department conduct their own \ndata collection for their own purposes. As many companies use DevOps, \ncontinuous deployment or some type of frequent, periodic delivery, the \nsemantics of the data tends to change with every new version of the \nsoftware. That makes it difficult to use data collected over longer \nperiods for training, requiring data scientists and engineers to \nmanually review the data from every period and convert it to create a \nlarger, homogeneous set that can be used for training.<\/p>\n\n\n\n<p>The second \nchallenge is that although most companies have vast amounts of data \navailable, typically very little of that data is labeled. That means \nthat data scientists and engineers have to do that manually or find \nother, automated ways to label data as it is generated. Depending on the\n task, this can be very time consuming and challenging. For instance, in\n the case of prediction, the time delay between the prediction and the \nactual outcome becoming known can be quite significant, complicating the\n labeling efforts.<\/p>\n\n\n\n<p>Third, in many cases, the ratio between \ndifferent classes of activities is amazingly large. For instance, in the\n case of preventive maintenance applications, machines may perform \nwithout issues for months or years on end before reaching a state where \nthey fail. This means that the data for the training set is highly \nunbalanced. Naively training with this data may easily cause a situation\n where the system never indicates an upcoming maintenance issue, having \nan accuracy in the high 99 percent range as most of the time it doesn\u2019t \nrequire preventive maintenance.<\/p>\n\n\n\n<p>The\n unbalanced training sets, of course, need to be balanced by humans for \ntraining purposes but this leads to the fourth challenge, \ntraining\/serving skew. Due to the large amount of manual work data \nscientists and engineers put in the data sets, among other things, a \nrelevant gap can easily emerge between model performance in training and\n in operation. Some companies use A\/B testing as a mechanism to avoid \nunderperforming models receiving too much traffic but that\u2019s not always \nfeasible.<\/p>\n\n\n\n<p>Finally, data pipelines have traditionally been treated \nwith much less mature engineering than code pipelines. This leads to a \nfrequent occurrence of data sources not being updated or providing data \nwith different semantics. The most dangerous case of this is when the \nengineers have inserted defensive mechanisms in the data pipeline, \nserving the model with stale data under the assumption that any data \nstream-related issues are temporary. If a part of the pipeline then \npermanently breaks, the model may underperform or not perform at all but\n it may still take a long time for this to surface.<\/p>\n\n\n\n<p>To address these issues, the concept of <a href=\"https:\/\/en.wikipedia.org\/wiki\/DataOps\">DataOps<\/a>\n is adopted in the research and industry communities. The notion is to \nautomate, instrument and maintain the data pipeline for analytics and \nML\/DL. There even is a <a href=\"https:\/\/www.dataopsmanifesto.org\/\">manifesto<\/a> with 18 principles!<\/p>\n\n\n\n<p>Concluding,  with many companies increasingly depending on data, analytics and AI,  the quality of the data and the stability of data pipelines is of  increasing importance. DataOps may well be a key for achieving what we  refer to as operational AI. Make sure you give your data the attention  it deserves!<\/p>\n\n\n\n<p><em>To get more insights earlier, sign up for my newsletter at<\/em><a href=\"https:\/\/mailto:jan@janbosch.com\/\" target=\"_blank\" rel=\"noreferrer noopener\"><em>jan@janbosch.com<\/em><\/a><em> or follow me on<\/em><a href=\"https:\/\/janbosch.com\/blog\" target=\"_blank\" rel=\"noreferrer noopener\"> <em>janbosch.com\/blog<\/em><\/a><em>, LinkedIn (<\/em><a href=\"https:\/\/www.linkedin.com\/in\/janbosch\/\" target=\"_blank\" rel=\"noreferrer noopener\"><em>linkedin.com\/in\/janbosch<\/em><\/a><em>) or Twitter (<\/em><a href=\"https:\/\/twitter.com\/JanBosch\" target=\"_blank\" rel=\"noreferrer noopener\"><em>@JanBosch<\/em><\/a><em>).<\/em><\/p>\n","protected":false},"excerpt":{"rendered":"<p>One of the things that keep surprising me over and over again is how much effort companies spend on processing, cleaning, converting and preparing data. For the companies that I work with, the data science teams easily spend 90-95 percent of their time just preparing data for use in machine learning\/deep learning deployments. This is &#8230; <a title=\"DataOps: the key to operational AI\" class=\"read-more\" href=\"https:\/\/janbosch.com\/blog\/index.php\/2019\/09\/25\/dataops-the-key-to-operational-ai\/\" aria-label=\"Read more about DataOps: the key to operational AI\">Read more<\/a><\/p>\n","protected":false},"author":2,"featured_media":0,"comment_status":"open","ping_status":"open","sticky":false,"template":"","format":"standard","meta":{"generate_page_header":"","footnotes":""},"categories":[15,4,10],"tags":[],"_links":{"self":[{"href":"https:\/\/janbosch.com\/blog\/index.php\/wp-json\/wp\/v2\/posts\/967"}],"collection":[{"href":"https:\/\/janbosch.com\/blog\/index.php\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/janbosch.com\/blog\/index.php\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/janbosch.com\/blog\/index.php\/wp-json\/wp\/v2\/users\/2"}],"replies":[{"embeddable":true,"href":"https:\/\/janbosch.com\/blog\/index.php\/wp-json\/wp\/v2\/comments?post=967"}],"version-history":[{"count":2,"href":"https:\/\/janbosch.com\/blog\/index.php\/wp-json\/wp\/v2\/posts\/967\/revisions"}],"predecessor-version":[{"id":970,"href":"https:\/\/janbosch.com\/blog\/index.php\/wp-json\/wp\/v2\/posts\/967\/revisions\/970"}],"wp:attachment":[{"href":"https:\/\/janbosch.com\/blog\/index.php\/wp-json\/wp\/v2\/media?parent=967"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/janbosch.com\/blog\/index.php\/wp-json\/wp\/v2\/categories?post=967"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/janbosch.com\/blog\/index.php\/wp-json\/wp\/v2\/tags?post=967"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}