How clean is your data?

Image by Pete Linforth from Pixabay

One of the sayings that almost everyone in business has taken to heart is “data is the new oil” by Clive Humby. There are constant discussions about data ownership between end customers and product providers, as well as between OEMs and their suppliers. The first start-ups are now trying to advise companies on how to monetize their data, as well as creating commercial marketplaces for data sets and data streams.

Data indeed is very important as it allows for a number of changes in the way we operate. First, rather than basing ourselves on opinions, it allows us to make data-driven decisions. A clear example of this is the use of A/B testing in online companies where every product decision, big or small, is first tested with thousands or millions of customers before committing to it. Second, machine- and deep-learning models typically need large amounts of training and validation data. For all the interest in “one-shot learning” and related approaches in the data science community, the fact is that almost all industrial work in the community is supervised learning using lots and lots of data. Third, in the cases where the data from your primary customer base is relevant for another set of customers, it allows you to build a multi-sided ecosystem approach.

Although this all sounds fabulous, my experience across a large number of companies is that the reality is much less rosy. In practice, the data collected and stored by many companies leaves a lot to be desired. There are at least three patterns that I see happen frequently.

First, the metadata associated with a data set is often missing. This means that the data contains lots of numbers and potentially some text fields but the exact meaning of these has been lost in the mists of time. The key people involved can often present some general idea of the data contents but that’s not good enough for decision making, training ML/DL models or reselling the data. We need to be able to describe the data semantics precisely and in detail.

Second, most R&D teams take the generation of data less serious than the functionality of the system itself. Many forms of logging generate data in unstructured ways as the original intent was for human consumption and specifically for use by the developers themselves. This is exacerbated by the use of DevOps, as teams may actually change the meaning of logs every sprint, making it difficult to use the data from multiple software deployments. Even if the data can be used, it often requires significant manual effort to align data collected after different deployments.

Third, the data in these data sets is often a mix of different processes generating data and storing it into the same file or database table. As the data format then requires to be a superset of the various data records generated by different processes, we either see a lot of “NULL” values or overloading of fields in the data that different processes use with different semantics. An additional complication is that the data is frequently part of a sequence and separating different data items into the intended data sequences is complicated.

These patterns are just some of the challenges that we’ve identified. There are several more that you can find in our research publications and some of the publicly available talks that I’ve given. Concluding, however, it’s important to realize that although data is the new oil, it does require careful collection and processing to deliver on the promise of data-driven decision making, accurate and reliable training of ML/DL models and monetizing your data. It’s a good idea to frequently ask yourself and your teams the question, how clean is your data?

To get more insights earlier, sign up for my newsletter or follow me on, LinkedIn ( or Twitter (@JanBosch).