Synthetic Data: Big Data Industry Predictions for 2024

Our friends over at Mindtech have prepared a special set of compelling technology predictions for the year ahead surrounding the topic of synthetic data. The following Q&A style feature offers predictions from Steve Harris, CEO at Mindtech, the developer of the world’s leading end-to-end “synthetic” data creation platform for the training of AI vision systems. 

How will synthetic data impact AI development and deployment in 2024 and beyond?

The increasing demand for synthetic images and training data in 2024 is driven by restrictions on real-world images. With the monetisation opportunity and decreasing availability of real-world images, there’s a significant shift toward considering the lifespan of existing images and acquiring additional images.

Synthetic images and scenario videos are a crucial solution to this problem, providing the ability to create images without rights and avoiding GDPR/CCPA privacy issues. We expect a steep increase in interest in synthetic images and training data in the coming year.

What are the key challenges and opportunities you see with synthetic data and its impact on AI applications in 2024?

A key challenge is the acceptance of synthetic data, but there has been a shift in the right direction as users witness real results. Synthetic data has improved massively, even in the last 18 months there have been huge improvements in its realism. Convincing organisations and governments of its validity remains somewhat difficult, requiring an explanation of what exactly it can do and do well. This remains difficult as some businesses may have used older versions of synthetic data and do not know what its new capabilities are.

Opportunities are widespread, as synthetic data replaces the need for massive amounts of real data while maintaining privacy. The challenge lies in persuading stakeholders to embrace new approaches instead of sticking to the status quo.

Can you provide your thoughts on the potential breakthroughs in synthetic data utilisation that we can expect to see in 2024?

Potential breakthroughs in the usage of synthetic data are expected to be widespread, driven not only by increased demand but also by government legislation such as the EU AI Act.

Government regulations may force a change toward alternative data sources. Also, the continuous improvement in the quality of synthetic images is a significant factor, with synthetic data becoming increasingly realistic over the years. I believe the way the development is currently going suggests that by 2024, some synthetic data may be indistinguishable from real-world images.

How will the adoption of synthetic data impact data privacy and security concerns in the new world of AI, and what solutions might emerge?

The adoption of synthetic data addresses concerns related to the rights and privacy of real-world data. Challenges do come up when collecting real-world data, especially when filming in public spaces, necessitating model releases and approvals.

Legislative processes, like the EU AI Act or President Biden’s Executive Order, further complicate real-world data collection. Synthetic data offers a solution by being inherently privacy-compliant, enabling rapid and cost-effective data generation. Additionally, it plays a crucial role in testing models, especially for tasks like ID verification, where synthetic data allows testing against false information.

What industries or sectors do you think will benefit the most from the use of synthetic data in their AI initiatives in the coming year?

Industries, particularly those relying on foundation models like ChatGPT models, will benefit significantly from synthetic data. With legal battles affecting the availability of real-world data, synthetic data becomes a powerful tool for tuning models for specific marketplaces.

Sectors such as Smart City initiatives face challenges in obtaining diverse and specific data, making synthetic data invaluable. There’s a significant demand for smart spaces and an emerging interest in dangerous use cases, such as identifying people floating in the water.

The automotive industry will benefit significantly, particularly when it comes to safe testing. This is where AI can help massively with scenarios that may have previously not been able to have been tested in a controlled environment. 

When will real and synthetic data be indistinguishable?

In specific use cases, like 2D faces for ID verification, real and synthetic data are already indistinguishable. I believe that general photo-realism will be achieved by 2025, with certain use cases achieving indistinguishability in 2024.

Animation, predicting human actions, might take until 2025 due to the need for a more photorealistic environment. While individual items can already be seen to be indistinguishable, achieving overall scene complexity –whereby there are multiple synthetic visualisations at play – may take another two to three years. 

Will synthetic data and generative data get closer or further diverge?

Synthetic data and generative data are expected to develop different use cases. While generative data might not stand alone for training AI networks, it can still play a role in specific scenarios. Synthetic data and generative data may intertwine and cross over in certain situations, like using synthetic data to train generative data or incorporating generative data as part of synthetic data. However, they are likely to remain alone with unique benefits for specific use cases.

Sign up for the free insideAI News newsletter.

Join us on Twitter: https://twitter.com/InsideBigData1

Join us on LinkedIn: https://www.linkedin.com/company/insidebigdata/

Join us on Facebook: https://www.facebook.com/insideAI NewsNOW