It’s impossible to understand what’s going on in the enterprise technology space without first understanding data and how data is driving innovation. One could argue that the reverse is true, that it’s the innovation in technology that has driven the evolution of data use. Data has, after all, been a technology chicken-and-egg conundrum for a long time.
Recent developments in technology — Web3, edge computing, AI, machine learning, the metaverse — are all dependent upon data, requiring massive datasets to function effectively. But running parallel to that is the growing awareness by technology users of the importance of their personal data and protecting that data from rogue use. There are also regulations intended to stop enterprises from using personal data without explicit consent, which has led to some of the biggest tech companies falling afoul of regulators.
This creates a separate problem: Without large amounts of data — and accurate data at that — it is impossible to train AI or ML. It also renders edge computing, for example, pointless and severely limits the development of the metaverse. As this becomes more pressing, a possible solution to the problem is gaining traction: synthetic data.
What Is Synthetic Data?
Synthetic data refers to artificial information created to serve as substitute data for developers and businesses.
David Proctor, senior database manager at Corona, Calif.-based Everconnect, said in addition to being ethical, synthetic data is useful for a variety of industries, from manufacturing and finance to social media and machine learning, to replicate information gathered from real-world events without compromising the original data’s confidentiality. It copies important information that’s necessary for whatever the user needs, without linking it to specific individuals.
“There’re two problems with original data: It’s often highly secure/takes time to be accessible, and it can’t be used for testing hypothetical scenarios that aren’t available in its information,” he said. “So, synthetic data has its uses in a digital workplace. It allows data to be gathered and distributed between teams easily without having to worry about complying with privacy laws.”
There is one other advantage: If companies are able to generate customizable data, they can make projects run more efficiently since users don’t have to wait for access to the original information. This provides a greater degree of autonomy for employees, enabling them to be more efficient and focus on revenue-driving tasks.
“Even when working digitally, developers and testers can remain confident in the fact that the data they’re using is accurate and high quality,” Proctor said.
Related Article: Understanding Web3’s Supporting Blockchain Technology
Technological Revolution?
The evolution of data use is creating major disruption within a workplace that has already being disrupted by digital transformation. Matthew Paxton, founder and owner of gaming site Hypernia, said we are in the midst of a technological revolution fueled by intelligent decision-making. Data-driven decisions are only as precise and useful as the data upon which they rely to learn and identify patterns.
While some forms of data are abundant, others, such as individual financial documents and medical records, are difficult, if not impossible — and illegal — for third parties to collect and use for training purposes. Even Facebook, a company known for its massive amounts of data, resorted to the acquisition of AI Reverie, a synthetic data business, to improve its training capabilities. Gartner predicts that by 2024, 60 percent of the data used for the creation of AI and analytics projects will be synthetically generated.
Machine learning algorithms are used to create synthetic data, consuming real data to train on patterns of behavior and generate fake data that retains the statistical properties of the original dataset. In simpler terms, synthetic data is created to mimic real-world scenarios.
This is in contrast to the usual anonymized datasets, which can be re-identified using re-identification procedures. Because synthetic data is created in a lab, it’s not vulnerable to the same flaws as real data.
“To put it another way, rather than being collected or measured in the real world, synthetic data is created in digital realms,” Paxton said. “Synthetic data, while artificial, theoretically and statistically reflects real-world data. It can be as good as or better than data collected on actual objects, events or people for training an AI model, according to research.”
According to Paxton, as the industry shifts from big data to smart data, privacy policies will continue to tighten and dark data will become more dangerous. Synthetic data will undoubtedly play a larger part in AI model training in the future. Accuracy, diversity and variability are all important factors in AI training, regardless of the use case. That’s why, when it comes to synthetic data design and execution, a human consultative approach that focuses on understanding the model’s requirements and providing flexible, iterative solutions is critical.
Related Article: Building the Technology Behind the Metaverse
Creating Synthetic Data
Synthetic data creation is a difficult and complicated process. This kind of data can be simulated using a variety of techniques, from simple and random algorithms to complex machine learning models, said Michelangiolo Mazzeschi, lead data evangelist at Australian machine learning company Relevance AI.
The problem is that for complex use cases — for example, self-driving cars — there is an endless need for data. Other times, the right kind of data is not available, something Tesla refers to as edge cases. There are even possible scenarios that are too rare to have sufficient data to train reliable models. While it’s not easy, the most obvious solution in all these cases is to generate it. There are multiple ways of generating synthetic data, depending on the complexity and its use cases:
- Probability Distributions: A simple way of creating synthetic data is to pick a random set of numbers from any probability distribution. A good example of this is Monte-Carlo simulations, which are used to create stock data to calculate the potential future fluctuations of a portfolio.
- Multi-agent Simulations: Multi-agent simulations are used to simulate real-world environments, such as a simulated city that a self-driving car is trying to navigate. Such simulations are not only realistic, they may also have mistakes, glitches and bugs (the edge cases) that can match a rare real-world scenario, which, in turn, enriches the performance of neural networks.
- GAN: Generative adversarial networks (GAN) are machine learning models based on a deep learning architecture that generates data from scratch. This data can be realistic to the point of not being distinguishable from real data with the human eye. These models are widely used in the creation of deep fakes and AI-generated images, often used for classification and labeling purposes.
Use Cases for Synthetic Data
There are a number of business use cases where one or more of these techniques apply, including:
- Software testing: Synthetic data can help test exceptions in software design or software response when scaling.
- User-behavior: Private, non-shareable user data can be simulated and used to create vector-based recommendation systems and see how they respond to scaling.
- Marketing: By using multi-agent systems, it is possible to simulate individual user behavior and have a better estimate of how marketing campaigns will perform in their customer reach.
- Art: By using GAN neural networks, AI is capable of generating art that is highly appreciated by the collector community.
- Simulate production data: Synthetic data can be used in a production environment for testing purposes, from the resilience of data pipelines to strict policy compliance. The data can be modeled depending on the needs of each individual.
Related Article: How Close Are IBM’s Quantum Computing Predictions to Reality?
Synthetic Data and Digital Twins
Gaurav Gupta, global head of digital engineering for Stamford, Conn.-based technology research and advisory firm ISG, said although the use of simulated scenarios is only just getting started, there’s no doubt it’s here to stay — it’s for leveraging only synthetic datasets or building a value stream-based digital twin, which is a virtual representation that serves as the real-time digital counterpart of a physical object or process.
In fact, ISG predicts that both synthetic data and digital twins based on real-world data will coexist and complement each other in specific cases. In essence, from an output perspective, both are simulated models, but the input for them is different. For the digital twin, the input is real-world data. For the synthetic model, it’s ML-generated data. Gupta said he believes the infrastructure needs for both will intertwine and aid each other as they evolve.
Complementary use cases can be deployed based on the context, catering to different sets of applications. For example, where constrained by cost, logistics or privacy reasons, or when the real data is unpredictable or unavailable, synthetic data is the preferred choice. Meanwhile, digital twins are best suited to applications that require a closed loop between the product in use and its digital counterpart, such as predictive maintenance. Alternatively, for open-loop development scenarios (e.g., simulating accidents on the road to test autonomous vehicles), synthetic data alone can suffice.
“The two can and will complement each other,” Gupta said. “Synthetic data can augment digital twin applications, for example where the model needs to be shared with different stakeholders. In such cases, a synthetic data model can act as a proxy for the digital twin models. Conversely, a synthetic data model, by definition, is supposed to be as close to the real-world scenario as possible. Hence, an existing digital twin model can feed into or accelerate the creation of synthetic models.”
Simulating all kinds of scenarios using these techniques will provide enterprises with greater choices and imminently impact the management of risk across business value streams and processes.
“The more we digitalize, the more secondary data (or synthetic data) will be created and used. Real-world data won’t go away and is still needed, but as ML technologies become cheaper, they will also be applied to synthetic data,” he said.