{"id":1015,"date":"2020-01-15T18:14:07","date_gmt":"2020-01-15T18:14:07","guid":{"rendered":"http:\/\/janbosch.com\/blog\/?p=1015"},"modified":"2020-01-15T18:14:08","modified_gmt":"2020-01-15T18:14:08","slug":"why-your-data-is-useless","status":"publish","type":"post","link":"https:\/\/janbosch.com\/blog\/index.php\/2020\/01\/15\/why-your-data-is-useless\/","title":{"rendered":"Why your data is useless"},"content":{"rendered":"\n<figure class=\"wp-block-image\"><img loading=\"lazy\" decoding=\"async\" width=\"1024\" height=\"731\" src=\"https:\/\/janbosch.com\/blog\/wp-content\/uploads\/2020\/01\/town-sign-822236_1920-1024x731.jpg\" alt=\"\" class=\"wp-image-1016\" srcset=\"https:\/\/janbosch.com\/blog\/wp-content\/uploads\/2020\/01\/town-sign-822236_1920-1024x731.jpg 1024w, https:\/\/janbosch.com\/blog\/wp-content\/uploads\/2020\/01\/town-sign-822236_1920-300x214.jpg 300w, https:\/\/janbosch.com\/blog\/wp-content\/uploads\/2020\/01\/town-sign-822236_1920-768x548.jpg 768w, https:\/\/janbosch.com\/blog\/wp-content\/uploads\/2020\/01\/town-sign-822236_1920.jpg 1920w\" sizes=\"(max-width: 1024px) 100vw, 1024px\" \/><figcaption>Image by Gerd Altmann from Pixabay\n<\/figcaption><\/figure>\n\n\n\n<p>Virtually all organizations I work with have terabytes or even  petabytes of data stored in different databases and file systems.  However, there\u2019s a very interesting pattern I\u2019ve started to recognize  during recent months. On the one hand, the data that gets generated is  almost always intended for human interpretation. Consequently, there are  lots of alphanumeric data, comments and other unstructured data in  these files and databases. On the other hand, the size of the stored  data is so phenomenally large that it\u2019s impossible for any human to make  heads or tails of it.<\/p>\n\n\n\n<p>The consequence is that enormous amounts of\n time are required to preprocess the data in order to make it usable for\n training machine learning models or for inference using already trained\n models. Data scientists at a number of companies have told me that they\n and their colleagues spend well over 90 percent of their time and \nenergy on this.<\/p>\n\n\n\n<p>For\n most organizations, therefore, the only way to generate any value from \nthe vast amounts of data that are stored on their servers is to throw \nlots and lots of human resources at it. Since, oftentimes, the business \ncase for doing so is unclear or insufficient, the only logical \nconclusion is that the vast majority of data that\u2019s stored at companies \nis simply useless. It\u2019s dead weight and will never generate any relevant\n business value. Although the saying is that \u201cdata is the new oil\u201d, the \nreality is that most of it is mud pretending to be oil.<\/p>\n\n\n\n<p>Even if \nthe data is relevant, there are several challenges associated with using\n it in analytics or machine learning. The first is timeliness: if you \nhave a data set of, say, customer behavior that\u2019s 24, 12 or even only 6 \nmonths old, it\u2019s highly likely that your customer base has evolved and \nthat preferences and behaviors have changed, invalidating your data set.<\/p>\n\n\n\n<p>Second,\n particularly in companies that release new software frequently, such as\n when using DevOps, the problem is that with every software version, the\n way data is generated may have changed. Especially when the data is \ngenerated for human consumption, eg engineers debugging systems in \noperation, it\u2019s time consuming to merge data sets that were produced by \ndifferent versions of the software.<\/p>\n\n\n\n<p>Third, in many organizations, \nmultiple data sets are generated continuously, even by the same system. \nTo derive the information that\u2019s actually relevant for the company \nfrequently requires combining data from different sets. The challenge is\n that different data sets may not use the same way of timestamping \nentries, may store data at very different levels of abstraction and \nfrequency and may evolve in very unpredictable ways. This makes \ncombining the data effort consuming and any automation developed for the\n purpose very brittle and likely to fail unpredictably.<\/p>\n\n\n\n<p>My main \nmessage is that, rather than focusing on preprocessing data, we need to \nspend much more time and focus on how the data is produced in the first \nplace. The goal should be to generate data such that it doesn\u2019t require \nany preprocessing at all. This opens up a host of use cases and \nopportunities that I\u2019ll discuss in future articles.<\/p>\n\n\n\n<p>Concluding,  for all the focus on data, the fact of the matter is that in most  companies, most data is useless or requires prohibitive amounts of human  effort to unlock the value that it contains. Instead, we should focus  on how we generate data in the first place. The goal should be to do  that in such a way that the data can be used for analytics and machine  learning without any preprocessing. So, clean up the mess, get rid of  the useless data and generate data in ways that actually make sense.<\/p>\n\n\n\n<p><em>To get more insights earlier, sign up for my newsletter at&nbsp;<\/em><a rel=\"noreferrer noopener\" href=\"https:\/\/mailto:jan@janbosch.com\/\" target=\"_blank\"><em>jan@janbosch.com<\/em><\/a><em> or follow me on<\/em><a rel=\"noreferrer noopener\" href=\"https:\/\/janbosch.com\/blog\" target=\"_blank\"> <em>janbosch.com\/blog<\/em><\/a><em>, LinkedIn (<\/em><a rel=\"noreferrer noopener\" href=\"https:\/\/www.linkedin.com\/in\/janbosch\/\" target=\"_blank\"><em>linkedin.com\/in\/janbosch<\/em><\/a><em>) or Twitter (<\/em><a rel=\"noreferrer noopener\" href=\"https:\/\/twitter.com\/JanBosch\" target=\"_blank\"><em>@JanBosch<\/em><\/a><em>).<\/em>  <\/p>\n","protected":false},"excerpt":{"rendered":"<p>Virtually all organizations I work with have terabytes or even petabytes of data stored in different databases and file systems. However, there\u2019s a very interesting pattern I\u2019ve started to recognize during recent months. On the one hand, the data that gets generated is almost always intended for human interpretation. Consequently, there are lots of alphanumeric &#8230; <a title=\"Why your data is useless\" class=\"read-more\" href=\"https:\/\/janbosch.com\/blog\/index.php\/2020\/01\/15\/why-your-data-is-useless\/\" aria-label=\"Read more about Why your data is useless\">Read more<\/a><\/p>\n","protected":false},"author":2,"featured_media":0,"comment_status":"open","ping_status":"open","sticky":false,"template":"","format":"standard","meta":{"generate_page_header":"","footnotes":""},"categories":[15,4],"tags":[],"_links":{"self":[{"href":"https:\/\/janbosch.com\/blog\/index.php\/wp-json\/wp\/v2\/posts\/1015"}],"collection":[{"href":"https:\/\/janbosch.com\/blog\/index.php\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/janbosch.com\/blog\/index.php\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/janbosch.com\/blog\/index.php\/wp-json\/wp\/v2\/users\/2"}],"replies":[{"embeddable":true,"href":"https:\/\/janbosch.com\/blog\/index.php\/wp-json\/wp\/v2\/comments?post=1015"}],"version-history":[{"count":1,"href":"https:\/\/janbosch.com\/blog\/index.php\/wp-json\/wp\/v2\/posts\/1015\/revisions"}],"predecessor-version":[{"id":1017,"href":"https:\/\/janbosch.com\/blog\/index.php\/wp-json\/wp\/v2\/posts\/1015\/revisions\/1017"}],"wp:attachment":[{"href":"https:\/\/janbosch.com\/blog\/index.php\/wp-json\/wp\/v2\/media?parent=1015"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/janbosch.com\/blog\/index.php\/wp-json\/wp\/v2\/categories?post=1015"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/janbosch.com\/blog\/index.php\/wp-json\/wp\/v2\/tags?post=1015"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}