From Agile to Radical: data infrastructure

Image by StockSnap from Pixabay

There’s a scenario that keeps repeating itself in our collaborations with a variety of companies. We come in and agree to work on a specific data-driven use case. The company claims to have vast amounts of data and nobody in the company worries about not having the data available. That is, until they start looking into the details and then, surprise, find out that they have lots of data, but not the data we need to address the use case. This scenario is so typical that I suspect many who work with anything data-related have experienced it.

In my experience, there are at least three causes: cost optimization, lack of context and wrong time scales. Cost optimization is something every engineer has to deal with and collecting data for which there’s no clear use case is generally wasteful. It consumes compute power, storage and communication bandwidth and especially for companies that put thousands or even millions of devices on the market, the cost associated with data easily becomes quite significant.

Lack of context is concerned with the situation where the primary data item, eg the engine temperature in a vehicle, is recorded without the contextual information. So, we may have data points that show that the engine overheated at some point since the last service, but we don’t know under what circumstances. One of the reasons for this is that many companies instrument their products for quality assurance and record data for fault tracing. However, when looking to use the same data for product performance, things don’t work as well as the requirements on the recorded data are different for product performance than for quality assurance.

The timescale covered is another factor why data can be unsuitable for a specific use case. For instance, to save storage, many products collect aggregated data, such as the average power output over the last hour, the number of packages produced during a time period or the average speed of a vehicle. Many use cases, however, require real-time data with a data point per relevant event. In predictive maintenance use cases, it’s less about averages and more about extremes and anomalies. If you don’t have that data, the use case becomes impossible to realize.

Based on our experiences with various industries, it seems that few companies are intentional in planning the collection and storage of data. Many system and software architects are more concerned with realizing the system functionality than with the data that could or should be collected as the system is in operation. The point I’m trying to make here is that in a rapidly digitalizing world, we need to spend as much time on carefully designing the data infrastructure for collecting, processing and storing as we do on designing the system functionality.

One of the key discussions around the data infrastructure is concerned with “just in case” versus “just in time” data. Proponents of the former want to collect everything about the system as we never know if and when we get to use the data. Proponents of the latter claim that we should start collecting data only when we have an actual use case requiring that data. The problem is that both have relevant arguments. For instance, predictive maintenance or other use cases that focus on anomalous behavior often need extensive data collection as the incidence rate of the anomalies is so low that it’s hard to collect enough data points for analytics or for training machine learning models.

Although I’m sympathetic to the “just in case” proponents in theory, practice shows that almost always the “just in case” data simply doesn’t match the needs of data-driven use cases. So, generally, I’m in favor of a data infrastructure that makes it easy to start collecting specific data when we need it and putting filters in place on products in the field to only send relevant data back to the company.

A second discussion that’s very common in many companies is where to process and store data. In a typical embedded systems context, we can identify four levels. First, the device itself usually has some processing and storage capacity, even if it’s limited. Second, in the context of the device, but close to it, there often is some kind of edge computing and storage capacity available. In factories, for instance, there typically is a local compute infrastructure installed where data from the machinery in the factory can be processed and stored. Third, many companies have some on-premise IT infrastructure that’s typically used by the R&D department. The advantage is that this infrastructure is a capital expenditure, meaning that its use for processing and storage is free. Finally, virtually every company we work with has a public cloud setup where storage and processing are done on an infrastructure provided by third parties. Here, the business model typically is usage-based.

As part of the architectural design of the data infrastructure, architects and others need to decide what data to store where, where to process the data for each use case, how to allow for the collection and processing of new data and how to optimize cost and the value created from the various use cases. As we have multiple alternatives available and architecture responsibilities often don’t span everything from the device to the public cloud, achieving a set of guidelines for optimizing this problem proves to be surprisingly difficult for most companies.

One of the common misconceptions is that the public cloud is always the cheapest. Although uploading and storing data in the cloud indeed is very cheap, if you want to process this data or download it to your own infrastructure, the business models are such that things become very expensive very rapidly. So, whether to store and process data locally or in a public cloud is dependent on the use case. And use cases where vast amounts of data are processed on a regular basis often benefit from an on-premise infrastructure due to the business models of cloud providers.

Most companies have vast amounts of data available that for most use cases proves to be entirely useless. Instead, we need to carefully design a data infrastructure that provides suitable tradeoffs between several forces related to cost, extensibility, where to process and store data as well as future-proofing the architecture. Typically, this requires careful orchestration across the company as well as partners and suppliers to achieve a holistically optimized design. To end with a bit of a tongue-in-cheek quote from John Allan Paulos: “Data, data everywhere, but not a thought to think.”

Want to read more like this? Sign up for my newsletter at jan@janbosch.com or follow me on janbosch.com/blog, LinkedIn (linkedin.com/in/janbosch) or Twitter (@JanBosch).