There’s no such thing as “the data”

One of the typical patterns I experience a lot when working with companies on digitalization is the claim that they have all the data in the world. They can just pick what they want from the candy store to get the data-driven insights they’re looking for. When inspecting the data in detail, it rapidly becomes obvious that the amount of data available is indeed humongous in volume but highly limited in terms of usefulness, if not entirely useless.

There are many reasons why data can’t be used for the purpose we had imagined, but three very exemplary ones include lack of context, variants and confounding variables. Many companies, especially in the embedded-systems space, have collected data for decades, but the data collection was typically focused on quality assurance. Outlier data was collected as it often was indicative of specific defects that could then be more easily remedied by service staff. The problem is that for quality assurance, it’s usually sufficient to record that an event has occurred and it’s less relevant to know under what circumstances it happened. So, the context of the event isn’t recorded, which makes the use of this data for any other purpose basically impossible.

For example, in one of the automotive companies I work with, the vehicle recorded events where the engine temperature became too high. The initial purpose was that service staff could check these events when the vehicle was in the workshop as it would often indicate particular problems with the engine that the technician could then more easily fix by replacing specific parts. The engine design team became aware of this data being available and wanted to use it to optimize the design of future engines to avoid the overheating problem. At this point, it became clear that the lack of context, ie what happened to the vehicle that caused the engine to overheat, was lacking completely and hence the team couldn’t use the data.

Second, many SaaS companies have one version of their product and use DevOps to ensure that all servers are running the same version of the software. In the embedded-systems space, there tend to be several, if not many, variants of the product out in the field. Customers also configure the product according to their specific context and purpose and that of their company. Finally, as the product is owned by the customers, access to the data generated by the product isn’t automatic and needs to be agreed upon with them. Consequently, many companies have limited data for each specific variant and configuration of the system.

For example, in the automotive industry, a typical company will ship between 500,000 and a million vehicles per year. That seems like a very rich source of generated data. However, each region in the world puts its own requirements on these vehicles and consequently, a vehicle sold in the US can’t be used in the same data pool as one sold in Europe. At least not unless the data is verified to indeed be comparable. In addition, vehicles come in different models, each of which will certainly not be comparable to the others. On top of this, each model for each region has several variants and each variant has numerous configuration options available. Consequently, even if shipping a million vehicles per year, many companies only have a smaller number, eg a few tens of thousands, of vehicles for each region-model-variant combination.

Third, the typical case when working with data is that we’re looking for a specific factor that can’t be measured directly. Instead, we measure proxies with the intent of using them to determine or estimate the value of the factor we’re really interested in. The challenge is that in many cases, numerous confounding variables cause the relation between the proxy and that factor to be tenuous at best.

For example, in telecommunications, one of the key factors operators care about is customer satisfaction. The idea is that high customer satisfaction leads to lower churn, which is a major contributor to profitability as the cost of customer acquisition is quite high for operators. Although it’s possible to measure customer satisfaction by asking people (assuming you can trust what they say), if we want to have a real-time and continuous assessment, we need to measure how the network is being used. Proxies for this are the data volume consumed and the bandwidth experienced. Although these are perfectly measurable, the question is what the actual relation to customer satisfaction is. It turns out that the use of a mobile network is highly influenced by numerous other factors, including the weather (people mostly use Wi-Fi at home), large events (like football world cups), vacation periods, and so on. None of these variables have a bearing on customer satisfaction, but they do influence the proxy actually measured.

In my experience with quite a few companies, the usefulness of historical data is often highly limited. Instead, with the increasing adoption of DevOps, the better way tends to be to extend the software in the systems out in the field with the data collection functionality for the specific variable or factor you’re looking to measure. This allows you to evaluate the usefulness of the collected data and change the code if it turns out that things aren’t optimal.

Second, rather than trying to determine correlations and causations between factors and variables using historical data, use A/B testing instead. When done properly, A/B testing allows for quantitative, statistically relevant conclusions concerning causation and relations between things that can be measured and things we care to know about. Of course, there’s the never-ending debate between frequentist and Bayesian statistics aficionados, but in my view, it’s more important to simply have the data and be able to analyze it.

Third, although this is relevant for other contexts as well, reducing the number of variants of systems in the field and ensuring the data generated by each system is legally available to the company is critical for easier analysis as a higher number of comparable system instances automatically leads to higher statistical relevance and shorter testing periods.

There’s no such thing as “the data.” Instead, each question we seek to answer and each variable we look to track over time needs specific data to be collected from systems in the field. In practice, historical data often lacks context and is highly limited in volume. In addition, confounding variables complicate analysis to a significant extent. To address this, use DevOps to update software to systems in the field to generate data when you need it. Use A/B testing to establish statistically relevant causations between measurable factors and limit the variants and configurations of deployed systems. As Mark Twain so beautifully said: “Data is like garbage. You better know what you’re going to do with it before you collect it.”

Want to read more like this? Sign up for my newsletter at jan@janbosch.com or follow me on janbosch.com/blog, LinkedIn (linkedin.com/in/janbosch), Medium or Twitter (@JanBosch).