{"id":1026,"date":"2020-01-31T14:42:37","date_gmt":"2020-01-31T14:42:37","guid":{"rendered":"http:\/\/janbosch.com\/blog\/?p=1026"},"modified":"2020-01-31T14:42:42","modified_gmt":"2020-01-31T14:42:42","slug":"how-to-generate-data-for-machine-learning","status":"publish","type":"post","link":"https:\/\/janbosch.com\/blog\/index.php\/2020\/01\/31\/how-to-generate-data-for-machine-learning\/","title":{"rendered":"How to generate data for machine learning"},"content":{"rendered":"\n<figure class=\"wp-block-image\"><img loading=\"lazy\" decoding=\"async\" width=\"1024\" height=\"410\" src=\"https:\/\/janbosch.com\/blog\/wp-content\/uploads\/2020\/01\/images-3853117_1920-1024x410.jpg\" alt=\"\" class=\"wp-image-1027\" srcset=\"https:\/\/janbosch.com\/blog\/wp-content\/uploads\/2020\/01\/images-3853117_1920-1024x410.jpg 1024w, https:\/\/janbosch.com\/blog\/wp-content\/uploads\/2020\/01\/images-3853117_1920-300x120.jpg 300w, https:\/\/janbosch.com\/blog\/wp-content\/uploads\/2020\/01\/images-3853117_1920-768x307.jpg 768w, https:\/\/janbosch.com\/blog\/wp-content\/uploads\/2020\/01\/images-3853117_1920.jpg 1920w\" sizes=\"(max-width: 1024px) 100vw, 1024px\" \/><figcaption>Image by Gerd Altmann from Pixabay\n<\/figcaption><\/figure>\n\n\n\n<p>In recent columns, I\u2019ve been sharing my view on the quality of the  data that many companies have in their data warehouses, lakes or swamps.  In my experience, most of the data that companies have stored so  carefully is useless and will never generate any value for the company.  The data that actually is potentially useful tends to require vast  amounts of preprocessing before it can be used for machine learning, for  example. As a consequence, in most data science teams, more than 90  percent of all time is spent on preprocessing the data before it can  even be used for analytics or machine learning.<\/p>\n\n\n\n<p>In a <a href=\"http:\/\/arxiv.org\/abs\/2001.10794\">paper<\/a> that we \nrecently submitted, we studied this problem for system logs. Virtually \nany software-intensive system generates data capturing the state and \nsignificant events in the system at important points in time. The \nchallenge is that, on the one hand, the data captured in logs is \nintended for human consumption and, consequently, contains a high \nvariability in the structure, content and type of the information for \neach log entry. On the other hand, the amount of data stored in logs \noften is phenomenally large. It\u2019s not atypical for systems to generate \ngigabytes of data for even a single day of operations.<\/p>\n\n\n\n<p>The obvious answer to this conundrum is to use machine learning to \nderive the relevant information from the system logs. This approach \nexperiences a number of significant challenges due to the way logs are \ngenerated. Based on our research in literature and company cases, we \nidentified several challenges.<\/p>\n\n\n\n<p>Due to the nature of data generation, the logs require extensive \npreprocessing, reducing the value. It\u2019s also quite common that multiple \nsystem processes write into the same log file, complicating time series \nanalysis and other machine learning techniques assuming sequential data.\n Conversely, many systems generate multiple types of log files and \nestablishing a reliable ground truth requires combining data from \nmultiple log files. These log files tend to contain data at \nfundamentally different levels of abstraction, complicating the training\n of machine learning models. Once we\u2019re able to apply machine learning \nmodels to the preprocessed data, interpretation of the results often \nrequires extensive domain knowledge. Developers are free to add new code\n to the system that generates log entries in ad-hoc formats. The \nchanging format of log files complicates the use of multiple logs for \ntraining machine learning models as the logs aren\u2019t necessarily \ncomparable. Finally, any tools built to process log files, such as \nautomated parsers, fail unpredictably and are very brittle, requiring \nconstant maintenance.<\/p>\n\n\n\n<p>We studied the problem specifically for system logs, but my \nexperience is that our findings are quite typical for virtually any type\n of automated data generation. Although this is a huge problem for \nalmost all companies that I work with and enormous amounts of resources \nare spent on preprocessing data to get value out of it, it\u2019s a losing \nbattle. The amount of data generated in any product, by customers, \nacross the company, and so on, will only continue to go up. If we don\u2019t \naddress this problem, every data scientist, engineer and mathematician \nwill soon be doing little else than preprocessing data.<\/p>\n\n\n\n<p>The solution, as we propose in the <a href=\"http:\/\/arxiv.org\/abs\/2001.10794\">paper<\/a>,\n is quite simple: rather than first generating the data and then \npreprocessing it, we need to build software to generate data in such a \nformat that preprocessing isn\u2019t required at all. Any data should be \ngenerated in such a way that it can immediately and automatically be \nused for machine learning. Preferably without any human intervention.<\/p>\n\n\n\n<p>Accomplishing this goal is a bit more involved than what I can \noutline in this post, but there are a number of key elements that I \nbelieve are common for any approach aiming to achieve this. First, all \ndata should be numerical. Second, all data of the nominal type \n(different elements have no order nor relationship to each other) should\n be one-hot encoded, meaning that the elements are mapped to a binary \nstring as long as the number of element types. Third, data of the \nordinal type can use the same approach or, in the case of \nnon-dichotomous data, use a variety of encodings. Fourth, interval and \nratio data needs to be normalized (mapped to a value between 0 and 1) \nfor optimal use by machine and deep-learning algorithms. Five, where \nnecessary, the statistical distribution of the data needs to be mapped \nto a standard Gaussian distribution for better training results.<\/p>\n\n\n\n<div class=\"wp-block-image\"><figure class=\"aligncenter\"><img decoding=\"async\" src=\"https:\/\/bits-chips.nl\/wp-content\/uploads\/2020\/01\/Jan-Bosch-34-Figure.jpg\" alt=\"\"\/><figcaption>System logging for machine learning<\/figcaption><\/figure><\/div>\n\n\n\n<p>Accomplishing this at the point of data generation may require \nengineers and developers to interact with data scientists. In addition, \nit calls for alignment across the organization, which hasn\u2019t been \nnecessary up to now. However, doing so allows companies to build systems\n that can fully autonomously collect, train and retrain machine learning\n models and deploy these without any human involvement (see the figure).<\/p>\n\n\n\n<p>Concluding, most data in most companies is useless because it was  generated in the wrong way and without proper structure, encoding and  standardization. Especially for the use of this data in training machine  learning models, this is problematic as it requires extensive amounts  of data preprocessing. Rather than improving our data preprocessing  activities, we need to generate data in a way that removes the need for  any preprocessing at all. Data scientists and engineers would benefit  from focusing on how data should be generated. Rather than trying to  clean up the mess afterward, let\u2019s try to not create any mess to begin  with.<\/p>\n\n\n\n<p><em>For more information, you can read the paper that I mentioned here:<\/em><br><a href=\"https:\/\/arxiv.org\/abs\/2001.10794\">https:\/\/arxiv.org\/abs\/2001.10794<\/a><\/p>\n\n\n\n<p><em>To get more insights earlier, sign up for my newsletter at\u00a0<\/em><a rel=\"noreferrer noopener\" href=\"https:\/\/mailto:jan@janbosch.com\/\" target=\"_blank\"><em>jan@janbosch.com<\/em><\/a><em> or follow me on<\/em><a rel=\"noreferrer noopener\" href=\"https:\/\/janbosch.com\/blog\" target=\"_blank\"> <em>janbosch.com\/blog<\/em><\/a><em>, LinkedIn (<\/em><a rel=\"noreferrer noopener\" href=\"https:\/\/www.linkedin.com\/in\/janbosch\/\" target=\"_blank\"><em>linkedin.com\/in\/janbosch<\/em><\/a><em>) or Twitter (<\/em><a rel=\"noreferrer noopener\" href=\"https:\/\/twitter.com\/JanBosch\" target=\"_blank\"><em>@JanBosch<\/em><\/a><em>).<\/em>   <\/p>\n","protected":false},"excerpt":{"rendered":"<p>In recent columns, I\u2019ve been sharing my view on the quality of the data that many companies have in their data warehouses, lakes or swamps. In my experience, most of the data that companies have stored so carefully is useless and will never generate any value for the company. The data that actually is potentially &#8230; <a title=\"How to generate data for machine learning\" class=\"read-more\" href=\"https:\/\/janbosch.com\/blog\/index.php\/2020\/01\/31\/how-to-generate-data-for-machine-learning\/\" aria-label=\"Read more about How to generate data for machine learning\">Read more<\/a><\/p>\n","protected":false},"author":2,"featured_media":0,"comment_status":"open","ping_status":"open","sticky":false,"template":"","format":"standard","meta":{"generate_page_header":"","footnotes":""},"categories":[15,4,10],"tags":[],"_links":{"self":[{"href":"https:\/\/janbosch.com\/blog\/index.php\/wp-json\/wp\/v2\/posts\/1026"}],"collection":[{"href":"https:\/\/janbosch.com\/blog\/index.php\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/janbosch.com\/blog\/index.php\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/janbosch.com\/blog\/index.php\/wp-json\/wp\/v2\/users\/2"}],"replies":[{"embeddable":true,"href":"https:\/\/janbosch.com\/blog\/index.php\/wp-json\/wp\/v2\/comments?post=1026"}],"version-history":[{"count":1,"href":"https:\/\/janbosch.com\/blog\/index.php\/wp-json\/wp\/v2\/posts\/1026\/revisions"}],"predecessor-version":[{"id":1028,"href":"https:\/\/janbosch.com\/blog\/index.php\/wp-json\/wp\/v2\/posts\/1026\/revisions\/1028"}],"wp:attachment":[{"href":"https:\/\/janbosch.com\/blog\/index.php\/wp-json\/wp\/v2\/media?parent=1026"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/janbosch.com\/blog\/index.php\/wp-json\/wp\/v2\/categories?post=1026"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/janbosch.com\/blog\/index.php\/wp-json\/wp\/v2\/tags?post=1026"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}