{"id":2215,"date":"2026-04-13T11:09:46","date_gmt":"2026-04-13T11:09:46","guid":{"rendered":"https:\/\/janbosch.com\/blog\/?p=2215"},"modified":"2026-04-13T11:09:46","modified_gmt":"2026-04-13T11:09:46","slug":"who-needs-data-when-you-can-create-it","status":"publish","type":"post","link":"https:\/\/janbosch.com\/blog\/index.php\/2026\/04\/13\/who-needs-data-when-you-can-create-it\/","title":{"rendered":"Who needs data when you can create it?"},"content":{"rendered":"\n<figure class=\"wp-block-image size-large\"><img loading=\"lazy\" decoding=\"async\" width=\"1024\" height=\"768\" src=\"https:\/\/janbosch.com\/blog\/wp-content\/uploads\/2020\/11\/franki-chamaki-1K6IQsQbizI-unsplash-1024x768.jpg\" alt=\"\" class=\"wp-image-1158\" srcset=\"https:\/\/janbosch.com\/blog\/wp-content\/uploads\/2020\/11\/franki-chamaki-1K6IQsQbizI-unsplash-1024x768.jpg 1024w, https:\/\/janbosch.com\/blog\/wp-content\/uploads\/2020\/11\/franki-chamaki-1K6IQsQbizI-unsplash-300x225.jpg 300w, https:\/\/janbosch.com\/blog\/wp-content\/uploads\/2020\/11\/franki-chamaki-1K6IQsQbizI-unsplash-768x576.jpg 768w, https:\/\/janbosch.com\/blog\/wp-content\/uploads\/2020\/11\/franki-chamaki-1K6IQsQbizI-unsplash.jpg 1920w\" sizes=\"(max-width: 1024px) 100vw, 1024px\" \/><figcaption class=\"wp-element-caption\">Photo by Franki Chamaki on Unsplash<\/figcaption><\/figure>\n\n\n\n<p>Over the last decade, many of the companies I work with through Software Center have made significant investments in data. Sensors have been deployed, systems instrumented and pipelines built to collect and store vast amounts of information. In principle, this should provide a strong foundation for data- and AI-driven innovation.<\/p>\n\n\n\n<p>In practice, however, a recurring pattern emerges. When a team wants to develop a specific use case, such as predictive maintenance for a rare failure mode, perception for an edge case in autonomous driving or a new optimization algorithm, they often discover that the required data is either not available, too sparse, biased or simply inaccessible due to regulatory constraints. GDPR, regional data residency requirements, contractual limitations and internal governance policies frequently prevent data from being used in the way it was originally intended. As a result, despite \u201chaving a lot of data,\u201d companies often lack the right data.<\/p>\n\n\n\n<p>This is where synthetic data and simulation enter the picture. Rather than relying exclusively on real-world data, companies can generate artificial data that mimics reality. Synthetic data allows organizations to create training environments where edge cases can be produced on demand, sensitive information can be removed by design and scenarios can be explored that would be prohibitively expensive, or even impossible, to capture in the physical world.<\/p>\n\n\n\n<p>Startups such as Gretel.ai (bought by Nvidia last year) and Parallel Domain are building platforms that enable exactly this. They provide tools to generate realistic images, sensor data, tabular datasets and entire simulated environments tailored to specific use cases.<\/p>\n\n\n\n<p>The applications are already compelling. In autonomous vehicles, synthetic data is used to train perception systems on rare but critical scenarios, such as unusual weather conditions or near-accidents, that would take years to observe in the real world. In robotics, simulation environments allow systems to learn tasks through millions of iterations without the cost and wear of physical hardware. In industrial settings, digital twins replicate factories or products, enabling experimentation and optimization without disrupting operations. And for many data-driven applications, synthetic datasets provide a way to develop and test models without exposing sensitive personal or proprietary information.<\/p>\n\n\n\n<p>Beyond the technical capabilities, synthetic data introduces two important strategic shifts. First, for incumbents, it provides a way to circumvent the constraints of real-world data. Instead of being limited by what has been collected and what\u2019s legally permissible to use, organizations can generate the data they need, aligned with their use case and compliant by design. This is particularly important in regulated industries, where the friction associated with data access is often one of the main bottlenecks to innovation.<\/p>\n\n\n\n<p>Second, for startups, synthetic data changes the competitive landscape. Traditionally, access to large proprietary datasets has been a key barrier to entry. Companies with scale had a significant advantage simply because they owned more data. Synthetic data weakens this advantage. Startups can now create high-quality training data without owning massive real-world datasets, allowing them to compete more effectively with incumbents.<\/p>\n\n\n\n<p>Despite its promise, synthetic data comes with important limitations that are easy to underestimate. The central challenge is one of fidelity: How do we know that the generated data truly reflects the real world? Even small deviations in statistical distributions, correlations or edge case frequencies can lead to models that perform well in simulation but fail in practice. This is often referred to as the \u201csim-to-real gap,\u201d particularly in domains such as robotics and autonomous systems. If the synthetic environment doesn\u2019t capture the complexity, noise and unpredictability of reality, systems trained on it may learn the wrong abstractions. In addition, synthetic data generation itself encodes assumptions about what matters, what can be ignored and how variables interact, which may introduce hidden biases.<\/p>\n\n\n\n<p>As a result, synthetic data rarely replaces real-world data entirely. Instead, it needs to be continuously validated and calibrated against real observations, with careful testing to ensure that models trained in simulated environments generalize reliably when deployed in the real world.<\/p>\n\n\n\n<p>Leading companies in synthetic data and simulation have converged on a set of practical tactics to make artificial data useful in the real world. One of the most important is domain randomization: Instead of trying to perfectly replicate reality, they deliberately vary parameters such as lighting, textures, object positions and sensor noise across a wide range so that models learn to generalize rather than overfit to a single \u2018perfect\u2019 simulation. Closely related is the practice of iterative sim-to-real validation, where models trained on synthetic data are continuously tested and fine-tuned on small amounts of real-world data to reduce the so-called reality gap. Another key tactic is increasing fidelity where it matters, such as using high-quality 3D assets, physically based rendering and realistic sensor models, to close appearance and content gaps between simulation and reality. At the same time, leading players increasingly incorporate domain knowledge into the generation process, ensuring that synthetic scenarios reflect real-world constraints and edge cases rather than purely random variation. Finally, many combine synthetic and real data in a hybrid approach: Large-scale synthetic data is used for pre-training, while smaller real datasets are used for calibration and validation. Together, these tactics reflect a shift from simply generating data to engineering data generation as a core capability, where the goal isn\u2019t realism per se, but reliable transfer of learning into the real world.<\/p>\n\n\n\n<p>The whole picture points to a broader shift in how we think about data and advantage. Historically, success in data-driven systems was closely tied to owning the world, as in instrumenting reality, collecting data at scale and building proprietary datasets. Increasingly, however, we see the emergence of an alternative approach: simulating the world. Instead of waiting for data to be generated, companies can create it. The most interesting organizations going forward will likely combine both approaches. They\u2019ll use real-world data to ground their models in reality, but rely on synthetic data and simulation to scale, explore edge cases and accelerate learning. In that sense, synthetic data isn\u2019t just a technical solution; it represents a shift from a passive to an active approach to data: from collecting what happens to creating what\u2019s needed. To return to Peter Drucker: \u201cThe best way to predict the future is to create it.\u201d<\/p>\n\n\n\n<p><em>Want to read more like this? Sign up for my newsletter at\u00a0<a href=\"https:\/\/mailto:jan@janbosch.com\/\">jan@janbosch.com<\/a>\u00a0or follow me on\u00a0<a href=\"https:\/\/janbosch.com\/blog\">janbosch.com\/blog<\/a>, LinkedIn (<a href=\"https:\/\/www.linkedin.com\/in\/janbosch\/\">linkedin.com\/in\/janbosch<\/a>) or X (<a href=\"https:\/\/twitter.com\/JanBosch\">@JanBosch<\/a>).<\/em><\/p>\n","protected":false},"excerpt":{"rendered":"<p>Over the last decade, many of the companies I work with through Software Center have made significant investments in data. Sensors have been deployed, systems instrumented and pipelines built to collect and store vast amounts of information. In principle, this should provide a strong foundation for data- and AI-driven innovation. In practice, however, a recurring &#8230; <a title=\"Who needs data when you can create it?\" class=\"read-more\" href=\"https:\/\/janbosch.com\/blog\/index.php\/2026\/04\/13\/who-needs-data-when-you-can-create-it\/\" aria-label=\"Read more about Who needs data when you can create it?\">Read more<\/a><\/p>\n","protected":false},"author":2,"featured_media":0,"comment_status":"closed","ping_status":"open","sticky":false,"template":"","format":"standard","meta":{"generate_page_header":"","footnotes":""},"categories":[4,3,10],"tags":[],"_links":{"self":[{"href":"https:\/\/janbosch.com\/blog\/index.php\/wp-json\/wp\/v2\/posts\/2215"}],"collection":[{"href":"https:\/\/janbosch.com\/blog\/index.php\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/janbosch.com\/blog\/index.php\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/janbosch.com\/blog\/index.php\/wp-json\/wp\/v2\/users\/2"}],"replies":[{"embeddable":true,"href":"https:\/\/janbosch.com\/blog\/index.php\/wp-json\/wp\/v2\/comments?post=2215"}],"version-history":[{"count":1,"href":"https:\/\/janbosch.com\/blog\/index.php\/wp-json\/wp\/v2\/posts\/2215\/revisions"}],"predecessor-version":[{"id":2216,"href":"https:\/\/janbosch.com\/blog\/index.php\/wp-json\/wp\/v2\/posts\/2215\/revisions\/2216"}],"wp:attachment":[{"href":"https:\/\/janbosch.com\/blog\/index.php\/wp-json\/wp\/v2\/media?parent=2215"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/janbosch.com\/blog\/index.php\/wp-json\/wp\/v2\/categories?post=2215"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/janbosch.com\/blog\/index.php\/wp-json\/wp\/v2\/tags?post=2215"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}