{"id":949,"date":"2019-08-29T08:00:47","date_gmt":"2019-08-29T08:00:47","guid":{"rendered":"http:\/\/janbosch.com\/blog\/?p=949"},"modified":"2019-08-28T13:50:54","modified_gmt":"2019-08-28T13:50:54","slug":"ai-is-not-about-data-sets","status":"publish","type":"post","link":"https:\/\/janbosch.com\/blog\/index.php\/2019\/08\/29\/ai-is-not-about-data-sets\/","title":{"rendered":"AI is not about data sets"},"content":{"rendered":"\n<figure class=\"wp-block-image\"><img loading=\"lazy\" decoding=\"async\" width=\"1024\" height=\"535\" src=\"https:\/\/janbosch.com\/blog\/wp-content\/uploads\/2019\/08\/wu-yi-ySm840-Uehc-unsplash-1024x535.jpg\" alt=\"\" class=\"wp-image-951\" srcset=\"https:\/\/janbosch.com\/blog\/wp-content\/uploads\/2019\/08\/wu-yi-ySm840-Uehc-unsplash-1024x535.jpg 1024w, https:\/\/janbosch.com\/blog\/wp-content\/uploads\/2019\/08\/wu-yi-ySm840-Uehc-unsplash-300x157.jpg 300w, https:\/\/janbosch.com\/blog\/wp-content\/uploads\/2019\/08\/wu-yi-ySm840-Uehc-unsplash-768x401.jpg 768w\" sizes=\"(max-width: 1024px) 100vw, 1024px\" \/><figcaption>Photo by wu yi on Unsplash \n<\/figcaption><\/figure>\n\n\n\n<p>As I\u2019m spending an increasing amount of time in the AI field with a variety of companies, I\u2019ve noticed an interesting misconception in the ML\/DL space. Many have a tendency to focus on data sets, experimenting with different models using a specific set of data and, finally, deploying a model in a specific context. This approach is very linear and \u2018waterfall-ish\u2019.<\/p>\n\n\n\n<p>It\u2019s easy to understand why many end up here \nas most data scientists spend the vast majority of their time \nevaluating, pre-processing and cleaning data sets and the thought of \nhaving to do this continuously is less than appealing. Instead, many \ndata scientists want to create a model of satisfactory accuracy and move\n on to the next project.<\/p>\n\n\n\n<p>This is a rather naive and simplistic \nviewpoint. In the successful, real-world systems that I\u2019ve been exposed \nto, teams use an iterative approach where new data is constantly \ncollected, the model is periodically updated and constantly retrained \nand, finally, the updated and retrained model is deployed in production \nwith the same DevOps heartbeat that all the system\u2019s components use. \nModel deployment often occurs in some form of A\/B testing environments \nto validate that the new version is not only performing better in \ntraining but indeed in practice as well.<\/p>\n\n\n\n<p>The\n point is that AI and specifically ML\/DL isn\u2019t about creating an ML\/DL \nmodel by laboriously processing and cleaning a specific data set, but \nrather to build a system where the data collected during the current \ndeployment can be used to update and train the model for the next \ndeployment. Once the creation of this system is the focus, the focus of \nthe team will shift as well. There are several changes to consider.<\/p>\n\n\n\n<p>First,\n the generation of data during operation should focus on, preferably, \nfully automated collection and subsequent preparation for and use in \ntraining. Many data scientists spend almost all their time cleaning and \npre-processing data and then train using standard algorithms from a \nvariety of publicly available libraries. Consequently, many find it \ndifficult to imagine a world where data can be used for training without\n any human interaction. However, when building this system, that\u2019s \nexactly what the focus should be.<\/p>\n\n\n\n<p>Second, once deployed, the model\n should provide mechanisms to continuously track its performance. As \nmodels are updated and retrained repeatedly, we need to verify during \ntraining as well as during operations that they\u2019re performing well.<\/p>\n\n\n\n<p>Third,\n subsequent versions of models should be evaluated to ensure that \ncontinuous improvement is achieved \u2013 not only in training but especially\n also in operation. This is where A\/B testing or multi-armed bandit \n(MAB) systems come in, allowing you to feed the newly deployed model a \nsmall slice of the data traffic and to crank this up over time as the \nperformance proves to be better than the old model.<\/p>\n\n\n\n<p>Finally, we \nneed a hierarchical value model to link the performance of ML\/DL models \nat the component or subsystem level to the system and customer level \nvalue delivery. ML\/DL models operate in the context of a larger system \nand don\u2019t just need to perform in their own scope, but instead \ncontribute to delivering customer value at the system level. Connecting \ntheir performance with the overall system performance and ensuring that \nwe avoid the local optimization trap is critical for long-term success.<\/p>\n\n\n\n<p>Concluding,  although many working in the AI space focus a lot on data sets, if only  because people spend most of their time working on data sets,  especially ML\/DL isn\u2019t about that. Instead, it\u2019s about building a DevOps  system where the ML\/DL components are constantly improved and retrained  based on the most recent data collected using the latest deployment.  Getting that system right, both for the ML\/DL models and for the system  as a whole, is the real challenge.<\/p>\n\n\n\n<p><em>To get more insights earlier, sign up for my newsletter at<\/em><a href=\"https:\/\/mailto:jan@janbosch.com\/\"><em>jan@janbosch.com<\/em><\/a><em> or follow me on<\/em><a href=\"https:\/\/janbosch.com\/blog\"> <em>janbosch.com\/blog<\/em><\/a><em>, LinkedIn (<\/em><a href=\"https:\/\/www.linkedin.com\/in\/janbosch\/\"><em>linkedin.com\/in\/janbosch<\/em><\/a><em>) or Twitter (<\/em><a href=\"https:\/\/twitter.com\/JanBosch\"><em>@JanBosch<\/em><\/a><em>).<\/em><\/p>\n","protected":false},"excerpt":{"rendered":"<p>As I\u2019m spending an increasing amount of time in the AI field with a variety of companies, I\u2019ve noticed an interesting misconception in the ML\/DL space. Many have a tendency to focus on data sets, experimenting with different models using a specific set of data and, finally, deploying a model in a specific context. This &#8230; <a title=\"AI is not about data sets\" class=\"read-more\" href=\"https:\/\/janbosch.com\/blog\/index.php\/2019\/08\/29\/ai-is-not-about-data-sets\/\" aria-label=\"Read more about AI is not about data sets\">Read more<\/a><\/p>\n","protected":false},"author":2,"featured_media":0,"comment_status":"open","ping_status":"open","sticky":false,"template":"","format":"standard","meta":{"generate_page_header":"","footnotes":""},"categories":[15,4,10],"tags":[],"_links":{"self":[{"href":"https:\/\/janbosch.com\/blog\/index.php\/wp-json\/wp\/v2\/posts\/949"}],"collection":[{"href":"https:\/\/janbosch.com\/blog\/index.php\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/janbosch.com\/blog\/index.php\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/janbosch.com\/blog\/index.php\/wp-json\/wp\/v2\/users\/2"}],"replies":[{"embeddable":true,"href":"https:\/\/janbosch.com\/blog\/index.php\/wp-json\/wp\/v2\/comments?post=949"}],"version-history":[{"count":2,"href":"https:\/\/janbosch.com\/blog\/index.php\/wp-json\/wp\/v2\/posts\/949\/revisions"}],"predecessor-version":[{"id":952,"href":"https:\/\/janbosch.com\/blog\/index.php\/wp-json\/wp\/v2\/posts\/949\/revisions\/952"}],"wp:attachment":[{"href":"https:\/\/janbosch.com\/blog\/index.php\/wp-json\/wp\/v2\/media?parent=949"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/janbosch.com\/blog\/index.php\/wp-json\/wp\/v2\/categories?post=949"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/janbosch.com\/blog\/index.php\/wp-json\/wp\/v2\/tags?post=949"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}