{"id":1722,"date":"2023-09-11T07:19:11","date_gmt":"2023-09-11T07:19:11","guid":{"rendered":"https:\/\/janbosch.com\/blog\/?p=1722"},"modified":"2023-09-11T07:19:12","modified_gmt":"2023-09-11T07:19:12","slug":"theres-no-such-thing-as-the-data","status":"publish","type":"post","link":"https:\/\/janbosch.com\/blog\/index.php\/2023\/09\/11\/theres-no-such-thing-as-the-data\/","title":{"rendered":"There\u2019s no such thing as \u201cthe data\u201d"},"content":{"rendered":"\n<figure class=\"wp-block-image size-large\"><img loading=\"lazy\" decoding=\"async\" width=\"1024\" height=\"768\" src=\"https:\/\/janbosch.com\/blog\/wp-content\/uploads\/2020\/11\/franki-chamaki-1K6IQsQbizI-unsplash-1024x768.jpg\" alt=\"\" class=\"wp-image-1158\" srcset=\"https:\/\/janbosch.com\/blog\/wp-content\/uploads\/2020\/11\/franki-chamaki-1K6IQsQbizI-unsplash-1024x768.jpg 1024w, https:\/\/janbosch.com\/blog\/wp-content\/uploads\/2020\/11\/franki-chamaki-1K6IQsQbizI-unsplash-300x225.jpg 300w, https:\/\/janbosch.com\/blog\/wp-content\/uploads\/2020\/11\/franki-chamaki-1K6IQsQbizI-unsplash-768x576.jpg 768w, https:\/\/janbosch.com\/blog\/wp-content\/uploads\/2020\/11\/franki-chamaki-1K6IQsQbizI-unsplash.jpg 1920w\" sizes=\"(max-width: 1024px) 100vw, 1024px\" \/><figcaption class=\"wp-element-caption\">Photo by Franki Chamaki on Unsplash<\/figcaption><\/figure>\n\n\n\n<p>One of the typical patterns I experience a lot when working with companies on digitalization is the claim that they have all the data in the world. They can just pick what they want from the candy store to get the data-driven insights they\u2019re looking for. When inspecting the data in detail, it rapidly becomes obvious that the amount of data available is indeed humongous in volume but highly limited in terms of usefulness, if not entirely useless.<\/p>\n\n\n\n<p>There are many reasons why data can\u2019t be used for the purpose we had imagined, but three very exemplary ones include lack of context, variants and confounding variables. Many companies, especially in the embedded-systems space, have collected data for decades, but the data collection was typically focused on quality assurance. Outlier data was collected as it often was indicative of specific defects that could then be more easily remedied by service staff. The problem is that for quality assurance, it\u2019s usually sufficient to record that an event has occurred and it\u2019s less relevant to know under what circumstances it happened. So, the context of the event isn\u2019t recorded, which makes the use of this data for any other purpose basically impossible.<\/p>\n\n\n\n<p>For example, in one of the automotive companies I work with, the vehicle recorded events where the engine temperature became too high. The initial purpose was that service staff could check these events when the vehicle was in the workshop as it would often indicate particular problems with the engine that the technician could then more easily fix by replacing specific parts. The engine design team became aware of this data being available and wanted to use it to optimize the design of future engines to avoid the overheating problem. At this point, it became clear that the lack of context, ie what happened to the vehicle that caused the engine to overheat, was lacking completely and hence the team couldn\u2019t use the data.<\/p>\n\n\n\n<p>Second, many SaaS companies have one version of their product and use DevOps to ensure that all servers are running the same version of the software. In the embedded-systems space, there tend to be several, if not many, variants of the product out in the field. Customers also configure the product according to their specific context and purpose and that of their company. Finally, as the product is owned by the customers, access to the data generated by the product isn\u2019t automatic and needs to be agreed upon with them. Consequently, many companies have limited data for each specific variant and configuration of the system.<\/p>\n\n\n\n<p>For example, in the automotive industry, a typical company will ship between 500,000 and a million vehicles per year. That seems like a very rich source of generated data. However, each region in the world puts its own requirements on these vehicles and consequently, a vehicle sold in the US can\u2019t be used in the same data pool as one sold in Europe. At least not unless the data is verified to indeed be comparable. In addition, vehicles come in different models, each of which will certainly not be comparable to the others. On top of this, each model for each region has several variants and each variant has numerous configuration options available. Consequently, even if shipping a million vehicles per year, many companies only have a smaller number, eg a few tens of thousands, of vehicles for each region-model-variant combination.<\/p>\n\n\n\n<p>Third, the typical case when working with data is that we\u2019re looking for a specific factor that can\u2019t be measured directly. Instead, we measure proxies with the intent of using them to determine or estimate the value of the factor we\u2019re really interested in. The challenge is that in many cases, numerous confounding variables cause the relation between the proxy and that factor to be tenuous at best.<\/p>\n\n\n\n<p>For example, in telecommunications, one of the key factors operators care about is customer satisfaction. The idea is that high customer satisfaction leads to lower churn, which is a major contributor to profitability as the cost of customer acquisition is quite high for operators. Although it\u2019s possible to measure customer satisfaction by asking people (assuming you can trust what they say), if we want to have a real-time and continuous assessment, we need to measure how the network is being used. Proxies for this are the data volume consumed and the bandwidth experienced. Although these are perfectly measurable, the question is what the actual relation to customer satisfaction is. It turns out that the use of a mobile network is highly influenced by numerous other factors, including the weather (people mostly use Wi-Fi at home), large events (like football world cups), vacation periods, and so on. None of these variables have a bearing on customer satisfaction, but they do influence the proxy actually measured.<\/p>\n\n\n\n<p>In my experience with quite a few companies, the usefulness of historical data is often highly limited. Instead, with the increasing adoption of DevOps, the better way tends to be to extend the software in the systems out in the field with the data collection functionality for the specific variable or factor you\u2019re looking to measure. This allows you to evaluate the usefulness of the collected data and change the code if it turns out that things aren\u2019t optimal.<\/p>\n\n\n\n<p>Second, rather than trying to determine correlations and causations between factors and variables using historical data, use A\/B testing instead. When done properly, A\/B testing allows for quantitative, statistically relevant conclusions concerning causation and relations between things that can be measured and things we care to know about. Of course, there\u2019s the never-ending debate between frequentist and Bayesian statistics aficionados, but in my view, it\u2019s more important to simply have the data and be able to analyze it.<\/p>\n\n\n\n<p>Third, although this is relevant for other contexts as well, reducing the number of variants of systems in the field and ensuring the data generated by each system is legally available to the company is critical for easier analysis as a higher number of comparable system instances automatically leads to higher statistical relevance and shorter testing periods.<\/p>\n\n\n\n<p>There\u2019s no such thing as \u201cthe data.\u201d Instead, each question we seek to answer and each variable we look to track over time needs specific data to be collected from systems in the field. In practice, historical data often lacks context and is highly limited in volume. In addition, confounding variables complicate analysis to a significant extent. To address this, use DevOps to update software to systems in the field to generate data when you need it. Use A\/B testing to establish statistically relevant causations between measurable factors and limit the variants and configurations of deployed systems. As Mark Twain so beautifully said: \u201cData is like garbage. You better know what you\u2019re going to do with it before you collect it.\u201d<\/p>\n\n\n\n<p><em>Want to read more like this? Sign up for my newsletter at\u00a0<a href=\"https:\/\/mailto:jan@janbosch.com\/\">jan@janbosch.com<\/a> or follow me on <a href=\"https:\/\/janbosch.com\/blog\">janbosch.com\/blog<\/a>, LinkedIn (<a href=\"https:\/\/www.linkedin.com\/in\/janbosch\/\">linkedin.com\/in\/janbosch<\/a>), <a href=\"https:\/\/janbosch.medium.com\/\">Medium<\/a> or Twitter (<a href=\"https:\/\/twitter.com\/JanBosch\">@JanBosch<\/a>).<\/em><\/p>\n","protected":false},"excerpt":{"rendered":"<p>One of the typical patterns I experience a lot when working with companies on digitalization is the claim that they have all the data in the world. They can just pick what they want from the candy store to get the data-driven insights they\u2019re looking for. When inspecting the data in detail, it rapidly becomes &#8230; <a title=\"There\u2019s no such thing as \u201cthe data\u201d\" class=\"read-more\" href=\"https:\/\/janbosch.com\/blog\/index.php\/2023\/09\/11\/theres-no-such-thing-as-the-data\/\" aria-label=\"Read more about There\u2019s no such thing as \u201cthe data\u201d\">Read more<\/a><\/p>\n","protected":false},"author":2,"featured_media":0,"comment_status":"closed","ping_status":"open","sticky":false,"template":"","format":"standard","meta":{"generate_page_header":"","footnotes":""},"categories":[4,10],"tags":[],"_links":{"self":[{"href":"https:\/\/janbosch.com\/blog\/index.php\/wp-json\/wp\/v2\/posts\/1722"}],"collection":[{"href":"https:\/\/janbosch.com\/blog\/index.php\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/janbosch.com\/blog\/index.php\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/janbosch.com\/blog\/index.php\/wp-json\/wp\/v2\/users\/2"}],"replies":[{"embeddable":true,"href":"https:\/\/janbosch.com\/blog\/index.php\/wp-json\/wp\/v2\/comments?post=1722"}],"version-history":[{"count":1,"href":"https:\/\/janbosch.com\/blog\/index.php\/wp-json\/wp\/v2\/posts\/1722\/revisions"}],"predecessor-version":[{"id":1723,"href":"https:\/\/janbosch.com\/blog\/index.php\/wp-json\/wp\/v2\/posts\/1722\/revisions\/1723"}],"wp:attachment":[{"href":"https:\/\/janbosch.com\/blog\/index.php\/wp-json\/wp\/v2\/media?parent=1722"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/janbosch.com\/blog\/index.php\/wp-json\/wp\/v2\/categories?post=1722"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/janbosch.com\/blog\/index.php\/wp-json\/wp\/v2\/tags?post=1722"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}