{"id":981,"date":"2019-10-24T12:14:35","date_gmt":"2019-10-24T12:14:35","guid":{"rendered":"http:\/\/janbosch.com\/blog\/?p=981"},"modified":"2019-10-24T12:14:45","modified_gmt":"2019-10-24T12:14:45","slug":"how-clean-is-your-data","status":"publish","type":"post","link":"https:\/\/janbosch.com\/blog\/index.php\/2019\/10\/24\/how-clean-is-your-data\/","title":{"rendered":"How clean is your data?"},"content":{"rendered":"\n<figure class=\"wp-block-image\"><img loading=\"lazy\" decoding=\"async\" width=\"1024\" height=\"471\" src=\"https:\/\/janbosch.com\/blog\/wp-content\/uploads\/2019\/10\/network-4556932_1920-1024x471.jpg\" alt=\"\" class=\"wp-image-982\" srcset=\"https:\/\/janbosch.com\/blog\/wp-content\/uploads\/2019\/10\/network-4556932_1920-1024x471.jpg 1024w, https:\/\/janbosch.com\/blog\/wp-content\/uploads\/2019\/10\/network-4556932_1920-300x138.jpg 300w, https:\/\/janbosch.com\/blog\/wp-content\/uploads\/2019\/10\/network-4556932_1920-768x353.jpg 768w, https:\/\/janbosch.com\/blog\/wp-content\/uploads\/2019\/10\/network-4556932_1920.jpg 1920w\" sizes=\"(max-width: 1024px) 100vw, 1024px\" \/><figcaption>Image by Pete Linforth from Pixabay\n<\/figcaption><\/figure>\n\n\n\n<p>One of the sayings that almost everyone in business has taken to  heart is \u201cdata is the new oil\u201d by Clive Humby. There are constant  discussions about data ownership between end customers and product  providers, as well as between OEMs and their suppliers. The first  start-ups are now trying to advise companies on how to monetize their  data, as well as creating commercial marketplaces for data sets and data  streams.<\/p>\n\n\n\n<p>Data indeed is very important as it allows for a number \nof changes in the way we operate. First, rather than basing ourselves on\n opinions, it allows us to make data-driven decisions. A clear example \nof this is the use of A\/B testing in online companies where every \nproduct decision, big or small, is first tested with thousands or \nmillions of customers before committing to it. Second, machine- and \ndeep-learning models typically need large amounts of training and \nvalidation data. For all the interest in \u201cone-shot learning\u201d and related\n approaches in the data science community, the fact is that almost all \nindustrial work in the community is supervised learning using lots and \nlots of data. Third, in the cases where the data from your primary \ncustomer base is relevant for another set of customers, it allows you to\n build a multi-sided ecosystem approach.<\/p>\n\n\n\n<p>Although\n this all sounds fabulous, my experience across a large number of \ncompanies is that the reality is much less rosy. In practice, the data \ncollected and stored by many companies leaves a lot to be desired. There\n are at least three patterns that I see happen frequently.<\/p>\n\n\n\n<p>First, \nthe metadata associated with a data set is often missing. This means \nthat the data contains lots of numbers and potentially some text fields \nbut the exact meaning of these has been lost in the mists of time. The \nkey people involved can often present some general idea of the data \ncontents but that\u2019s not good enough for decision making, training ML\/DL \nmodels or reselling the data. We need to be able to describe the data \nsemantics precisely and in detail.<\/p>\n\n\n\n<p>Second,\n most R&amp;D teams take the generation of data less serious than the \nfunctionality of the system itself. Many forms of logging generate data \nin unstructured ways as the original intent was for human consumption \nand specifically for use by the developers themselves. This is \nexacerbated by the use of DevOps, as teams may actually change the \nmeaning of logs every sprint, making it difficult to use the data from \nmultiple software deployments. Even if the data can be used, it often \nrequires significant manual effort to align data collected after \ndifferent deployments.<\/p>\n\n\n\n<p>Third, the data in these data sets is often\n a mix of different processes generating data and storing it into the \nsame file or database table. As the data format then requires to be a \nsuperset of the various data records generated by different processes, \nwe either see a lot of \u201cNULL\u201d values or overloading of fields in the \ndata that different processes use with different semantics. An \nadditional complication is that the data is frequently part of a \nsequence and separating different data items into the intended data \nsequences is complicated.<\/p>\n\n\n\n<p>These patterns are just some of the  challenges that we\u2019ve identified. There are several more that you can  find in our research publications and some of the publicly available  talks that I\u2019ve given. Concluding, however, it\u2019s important to realize  that although data is the new oil, it does require careful collection  and processing to deliver on the promise of data-driven decision making,  accurate and reliable training of ML\/DL models and monetizing your  data. It\u2019s a good idea to frequently ask yourself and your teams the  question, how clean is your data?<\/p>\n\n\n\n<p><em>To get more insights earlier, sign up for my newsletter at<\/em><a href=\"https:\/\/mailto:jan@janbosch.com\/\" target=\"_blank\" rel=\"noreferrer noopener\"><em>jan@janbosch.com<\/em><\/a><em> or follow me on<\/em><a href=\"https:\/\/janbosch.com\/blog\" target=\"_blank\" rel=\"noreferrer noopener\"> <em>janbosch.com\/blog<\/em><\/a><em>, LinkedIn (<\/em><a href=\"https:\/\/www.linkedin.com\/in\/janbosch\/\" target=\"_blank\" rel=\"noreferrer noopener\"><em>linkedin.com\/in\/janbosch<\/em><\/a><em>) or Twitter (<\/em><a href=\"https:\/\/twitter.com\/JanBosch\" target=\"_blank\" rel=\"noreferrer noopener\"><em>@JanBosch<\/em><\/a><em>).<\/em><\/p>\n","protected":false},"excerpt":{"rendered":"<p>One of the sayings that almost everyone in business has taken to heart is \u201cdata is the new oil\u201d by Clive Humby. There are constant discussions about data ownership between end customers and product providers, as well as between OEMs and their suppliers. The first start-ups are now trying to advise companies on how to &#8230; <a title=\"How clean is your data?\" class=\"read-more\" href=\"https:\/\/janbosch.com\/blog\/index.php\/2019\/10\/24\/how-clean-is-your-data\/\" aria-label=\"Read more about How clean is your data?\">Read more<\/a><\/p>\n","protected":false},"author":2,"featured_media":0,"comment_status":"open","ping_status":"open","sticky":false,"template":"","format":"standard","meta":{"generate_page_header":"","footnotes":""},"categories":[15,4],"tags":[],"_links":{"self":[{"href":"https:\/\/janbosch.com\/blog\/index.php\/wp-json\/wp\/v2\/posts\/981"}],"collection":[{"href":"https:\/\/janbosch.com\/blog\/index.php\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/janbosch.com\/blog\/index.php\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/janbosch.com\/blog\/index.php\/wp-json\/wp\/v2\/users\/2"}],"replies":[{"embeddable":true,"href":"https:\/\/janbosch.com\/blog\/index.php\/wp-json\/wp\/v2\/comments?post=981"}],"version-history":[{"count":1,"href":"https:\/\/janbosch.com\/blog\/index.php\/wp-json\/wp\/v2\/posts\/981\/revisions"}],"predecessor-version":[{"id":983,"href":"https:\/\/janbosch.com\/blog\/index.php\/wp-json\/wp\/v2\/posts\/981\/revisions\/983"}],"wp:attachment":[{"href":"https:\/\/janbosch.com\/blog\/index.php\/wp-json\/wp\/v2\/media?parent=981"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/janbosch.com\/blog\/index.php\/wp-json\/wp\/v2\/categories?post=981"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/janbosch.com\/blog\/index.php\/wp-json\/wp\/v2\/tags?post=981"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}