Synthetic Data: The New Frontier of AI Training in the Post-Real Data Era

The Data Dilemma: Critical Challenges in 2026

In 2026, the AI industry faces an unprecedented critical challenge: real data is depleting at an alarming rate.

As global privacy regulations continue to tighten—from the EU’s GDPR to the CCPA in various US states, and Hong Kong’s Personal Data (Privacy) Ordinance—businesses face increasingly strict restrictions on collecting and using real data. Simultaneously, data labeling costs continue to soar, especially for high-quality medical images, financial transaction records, and customer behavior data that require specialized knowledge and significant human resources.

According to industry forecasts, by 2030, the proportion of synthetic data used in AI training is expected to grow from the current 5% to over 90%. This trend reflects a fundamental shift across the entire industry from “training AI with real data” to “training AI with synthetic data”.

Against this backdrop, businesses and developers are turning to a powerful alternative: Synthetic Data.

What is Synthetic Data?

Synthetic data is information artificially generated through AI models, simulation algorithms, or statistical methods that can perfectly simulate the statistical properties, distribution patterns, and complex relationships of real-world data, enabling machine learning training to occur without exposing any sensitive information.

Unlike real data, synthetic data is “created from scratch”—it doesn’t come from actual observations or measurements, but is generated through algorithms and models. This means that even if synthetic data looks identical to real data, it contains no identifiable personal information or commercially sensitive information.

Core Characteristics of Synthetic Data

Statistical Consistency: High-quality synthetic data retains the statistical characteristics of the original data, including distributions, correlations, and seasonal patterns.

Privacy Security: Since it doesn’t originate from real records, synthetic data cannot be traced back to any specific individual or organization.

Scalability: Arbitrary amounts of synthetic records can be generated on demand, from hundreds to millions.

Flexibility: Specific edge cases or rare scenarios can be deliberately introduced—scenarios that might be difficult to obtain in real data.

Why 2026 is a Turning Point for Synthetic Data?

Over the past few years, the development of synthetic data has gone through three important stages:

Stage 1 (2018-2022): Technology Exploration Phase. Primarily large technology companies and research institutions exploring the feasibility of synthetic data in laboratory environments.

Stage 2 (2022-2025): Initial Application Phase. Financial institutions and medical institutions began piloting the use of synthetic data for model training, particularly in risk assessment and medical image analysis.

Stage 3 (2026 onwards): Scaled Deployment Phase. With the maturation of generative AI technology and increasing regulatory pressure, synthetic data has shifted from an “option” to a “necessity”.

Core drivers of this transformation include:

Regulatory Pressure: Data protection regulations around the world are becoming increasingly strict, exposing businesses to massive fines and reputational risks.

Cost Considerations: The costs of collecting, labeling, and storing real data continue to rise, while the marginal cost of synthetic data approaches zero.

Technological Maturity: Advances in technologies such as Generative Adversarial Networks (GANs), Variational Autoencoders (VAEs), and Large Language Models (LLMs) have made generating high-quality synthetic data easier and more affordable.

Key Advantages of Synthetic Data

1. Revolutionary Breakthrough in Privacy Protection

In traditional AI training workflows, data privacy and model performance are often contradictory—more data for better model accuracy means greater privacy risks. Synthetic data completely changes this dilemma.

GDPR Compliance: The EU’s General Data Protection Regulation has strict restrictions on the use of personal data, but synthetic data falls outside its regulatory scope because it’s not “real” data.

Medical Data Applications: In healthcare, patient privacy is paramount. Using synthetic data, medical institutions can conduct federated learning and cross-institutional model training without sharing any patient information.

Financial Data Sharing: Banks and financial institutions can share synthetic data to collaboratively train fraud detection models without exposing their respective customer transaction records.

2. Unlimited Scale of Data Generation

Real data collection is often limited by time, geography, and resources. Synthetic data can break through these limitations:

On-Demand Generation: Millions of records can be generated in minutes, whereas collecting the same amount of real data might take months or even years.

Rapid Iteration: AI teams can quickly generate new datasets to test different hypotheses, accelerating the model development cycle.

Cost Efficiency: Although establishing a synthetic data generation system requires upfront investment, its long-term marginal cost is extremely low.

3. Guaranteed Perfect Labeling

Labeling real data is error-prone, especially for tasks involving subjective judgment or specialized knowledge. Synthetic data can achieve 100% accurate labeling:

Automated Labeling: Labels are completed simultaneously during data generation, requiring no human intervention.

Eliminating Ambiguity: Each record’s labels can be explicitly defined, ensuring models learn correct patterns.

Reducing Bias: Real data often contains various historical biases; synthetic data can be deliberately designed to balance or eliminate these biases.

4. Complete Coverage of Edge Cases

In the real world, data on rare events—such as bank fraud, medical complications, or traffic accidents—is often extremely scarce, making it difficult for AI models to learn how to handle these critical scenarios. Synthetic data can:

Deliberately Generate Rare Scenarios: Simulate situations that rarely occur in the real world, such as floods that happen once every 50 years or financial crises occurring once a century.

Test Extreme Conditions: Test autonomous driving systems under various extreme weather and road conditions in virtual environments.

Fill Data Gaps: For situations where certain groups or scenario data is insufficient, synthetic data can perform data augmentation.

Hong Kong’s Synthetic Data Application Prospects

As Asia’s leading financial center and technology hub, Hong Kong has unique advantages and tremendous potential in adopting synthetic data technology.

Banking and Financial Services

Hong Kong is home to over 160 licensed banks and numerous fintech companies, all handling vast amounts of sensitive customer data. Applications of synthetic data in the financial sector include:

Fraud Detection Model Training: Banks can generate synthetic data simulating fraudulent transactions to train AI systems without sharing actual customer transaction records. This is particularly valuable for smaller banks that may not have enough historical fraud data to train effective models.

Credit Risk Assessment: Using synthetic data to balance loan application datasets, addressing the imbalance between good and bad customers and improving the model’s ability to identify risk.

Market Simulation: Generating data simulating market volatility to stress test investment portfolios and risk management systems.

The Hong Kong Monetary Authority (HKMA) has been actively promoting the adoption of responsible AI technology in the banking sector in recent years. As a privacy-preserving technology, synthetic data aligns well with this regulatory direction.

Healthcare and Biotechnology

Hong Kong’s healthcare system is renowned for its high efficiency and quality services, and is also an important base for biotechnology research. Applications of synthetic data in healthcare include:

Medical Image Enhancement: For medical images of rare diseases, data is often insufficient due to collection difficulties. Synthetic data can generate more training samples through image augmentation techniques, improving the accuracy of AI diagnostic systems.

Cross-Institutional Research Cooperation: Hong Kong’s public hospitals and private medical institutions can conduct joint research by sharing synthetic data without worrying about patient privacy issues.

Drug Development: In the drug development process, synthetic data can be used to simulate the effects and side effects of candidate drugs, accelerating the screening process.

The Office of the Privacy Commissioner for Personal Data (PCPD) in Hong Kong has introduced an ethical framework for AI applications in recent years, providing clear guidelines for medical institutions using synthetic data.

Retail and E-commerce

Hong Kong is one of the most active retail markets globally, with rapid e-commerce development. Applications of synthetic data in retail include:

Customer Behavior Modeling: Generate data simulating purchasing behavior of different customer groups to help retailers optimize inventory management and marketing strategies.

Personalized Recommendations: Use synthetic data to train and test recommendation systems, ensuring systems can accurately predict customer preferences without invading privacy.

Price Optimization: Simulate consumer price sensitivity under different market conditions to help develop more precise pricing strategies.

Smart City Development

The Hong Kong SAR government has been vigorously promoting smart city development in recent years, with synthetic data playing an important role:

Traffic Simulation: Generate data simulating Hong Kong’s urban traffic patterns for optimizing traffic light timing and route planning.

Public Service Planning: Simulate public service demands under different demographic and social conditions to support government infrastructure planning.

Environmental Monitoring: Use synthetic data to supplement and validate environmental monitoring data such as air quality and noise levels.

Technical Methods for Generating Synthetic Data

Generative Adversarial Networks (GANs)

GANs are currently one of the most popular synthetic data generation technologies, first proposed by Ian Goodfellow in 2014. The core idea of GANs is to have two neural networks—the Generator and the Discriminator—对抗 against and learn from each other.

How It Works: The generator is responsible for creating fake data samples, while the discriminator is responsible for judging whether these samples come from the real data distribution. Through repeated adversarial training, the generator ultimately produces synthetic samples highly similar to real data.

Application Scenarios: GANs excel in image, video, and audio synthesis, particularly suitable for generating high-dimensional perceptual data.

Limitations: The GAN training process is unstable and prone to “Mode Collapse” problems, where the generator begins repeatedly generating the same samples.

Variational Autoencoders (VAEs)

VAEs are another commonly used synthetic data generation technology, especially suitable for scenarios requiring precise control over data distribution.

How It Works: VAEs learn to encode data into a Latent Space, then sample from this space to generate new data points. This approach ensures that generated data has statistical characteristics similar to the original data.

Advantages: VAEs provide more precise control over the data generation process, suitable for application scenarios requiring preservation of specific attributes (such as the distribution morphology of certain variables).

Large Language Models (LLMs) Generation

With the emergence of large language models such as GPT-4 and Claude, using LLMs to generate text-based synthetic data has become a new trend.

How It Works: Through carefully designed prompts, LLMs are guided to generate text data meeting specific requirements, such as customer service dialogues, product reviews, or medical records.

Advantages: Text data generated by LLMs is natural and fluent, suitable for application scenarios requiring large volumes of text records.

Considerations: Generated data needs careful review to ensure no accidental information leakage or bias amplification.

Simulator Generation

In certain specific domains such as autonomous driving and robotics technology, using professional simulators to generate synthetic data is a more appropriate choice.

How It Works: Simulate real-world physical rules and conditions in a virtual environment to generate sensor data (such as LiDAR and camera images).

Advantages: Environment conditions can be completely controlled, quickly generating training data for various scenarios, including dangerous or costly scenarios.

Challenges and Risks of Synthetic Data

Despite its many advantages, practical applications of synthetic data also need to be aware of the following challenges:

Complexity of Quality Assessment

Problem: How to ensure synthetic data truly represents the real data distribution? How to detect anomalies or errors in synthetic data?

Solution: A complete quality assessment framework needs to be established, including statistical similarity metrics, utility testing, and privacy risk assessment. Commonly used evaluation metrics include Kolmogorov-Smirnov tests, Wasserstein distances, and privacy loss metrics.

Residual Privacy Risks

Problem: Although synthetic data doesn’t directly contain real records, attackers might infer whether certain individuals were in the original dataset through “Membership Inference Attacks.”

Solution: Adopt differential privacy technology to introduce controlled noise during the data generation or release process, ensuring no specific individual’s information can be inferred from synthetic data.

Vulnerability to Adversarial Examples

Problem: If attackers know the target model was trained using synthetic data, they might design adversarial examples targeting specific patterns in synthetic data.

Solution: Introduce diversity and randomness when generating synthetic data to prevent models from overfitting to specific characteristics of synthetic data.

Legal and Regulatory Gray Areas

Problem: Different regions may have different legal definitions and regulatory requirements for synthetic data, bringing complexity to applications for multinational enterprises.

Solution: Closely monitor regulatory developments in various regions, especially the EU, mainland China, and Hong Kong’s relevant regulations, to ensure the use of synthetic data complies with local legal requirements.

Best Practices for Responsible Use of Synthetic Data

Establish Comprehensive Governance Frameworks

Transparency: Clearly document synthetic data generation methods, source data, and potential limitations.

Audit Trails: Maintain complete data lineage records, ensuring each synthetic record’s generation process can be traced.

Multi-Party Review: Introduce cross-functional teams (including data scientists, legal experts, and business representatives) to assess synthetic data’s applicability and risks.

Adopt Advanced Privacy Protection Technologies

Differential Privacy: Introduce mathematically meaningful privacy protection during data generation or release processes.

Enhanced Anonymization: Even when using synthetic data, still follow minimization principles, only generating and using necessary data.

Access Control: Classify synthetic data by sensitivity levels, setting different access permissions based on sensitivity levels.

Continuous Monitoring and Validation

Regular Assessment: Regularly verify the statistical consistency between synthetic data and real data, ensuring quality doesn’t degrade over time.

Performance Monitoring: Track the performance of models trained on synthetic data in real environments, promptly identifying issues.

Feedback Loops: Establish user feedback mechanisms to collect opinions on synthetic data quality and continuously improve generation methods.

Future Development Trends of Synthetic Data

2026-2028: Technology Standardization

Trend: As more enterprises adopt synthetic data technology, the industry will begin forming unified quality standards, assessment methods, and regulatory guidelines.

Opportunity: Early adopters and standards-setting participants will gain competitive advantages.

Challenge: Existing enterprises need to update their data governance frameworks to accommodate new technologies.

2028-2030: Integration with Other Technologies

Trend: Synthetic data will deeply integrate with other privacy protection technologies such as Federated Learning, Differential Privacy, and Trusted Execution Environments (TEEs).

Opportunity: Enterprises can establish a more comprehensive technical system for data privacy and security.

Challenge: Increased system complexity requires more specialized talent and resource investment.

Post-2030: AI-Native Data Strategy

Trend: With the maturation of synthetic data technology, “AI-Native” data strategies will become mainstream, with enterprises designing data architectures with synthetic data as the core from the start.

Opportunity: This will greatly change the enterprise data value chain, creating entirely new business models and services.

Challenge: Higher requirements are imposed on enterprises’ organizational capabilities, technical architecture, and talent reserves.

SCGA and the Future of Synthetic Data

As Hong Kong’s leading AI innovation community, SCGA is committed to promoting the development and application of responsible AI technology. Synthetic data represents a significant turning point in the AI field, providing us with an effective way to balance innovation and privacy protection.

Looking ahead, SCGA will continue to monitor trends in synthetic data technology, providing members and industry with the latest technical information, training resources, and exchange platforms. We believe that through the responsible use of synthetic data, Hong Kong’s AI ecosystem can accelerate innovation while protecting individual privacy and commercial secrets, maintaining a leading position in the global AI competition.

The era of synthetic data has arrived. As AI practitioners, each of us has the responsibility to ensure this technology is used correctly and ethically, creating real value for society.

This article is written by SCGA (Hong Kong AI Innovation Community). Reproduction and sharing are welcome. For inquiries, please contact the SCGA team.

Tags: #SyntheticData #AITraining #DataPrivacy #MachineLearning #SCGA #HongKong #AI2026 #ResponsibleAI #SmartCity #FinTech

Synthetic Data: The New Frontier of AI Training in the Post-Real Data Era

Synthetic Data: The New Frontier of AI Training in the Post-Real Data Era

The Data Dilemma: Critical Challenges in 2026

What is Synthetic Data?

Core Characteristics of Synthetic Data

Why 2026 is a Turning Point for Synthetic Data?

Key Advantages of Synthetic Data

1. Revolutionary Breakthrough in Privacy Protection

2. Unlimited Scale of Data Generation

3. Guaranteed Perfect Labeling

4. Complete Coverage of Edge Cases

Hong Kong’s Synthetic Data Application Prospects

Banking and Financial Services

Healthcare and Biotechnology

Retail and E-commerce

Smart City Development

Technical Methods for Generating Synthetic Data

Generative Adversarial Networks (GANs)

Variational Autoencoders (VAEs)

Large Language Models (LLMs) Generation

Simulator Generation

Challenges and Risks of Synthetic Data

Complexity of Quality Assessment

Residual Privacy Risks

Vulnerability to Adversarial Examples

Legal and Regulatory Gray Areas

Best Practices for Responsible Use of Synthetic Data

Establish Comprehensive Governance Frameworks

Adopt Advanced Privacy Protection Technologies

Continuous Monitoring and Validation

Future Development Trends of Synthetic Data

2026-2028: Technology Standardization

2028-2030: Integration with Other Technologies

Post-2030: AI-Native Data Strategy

SCGA and the Future of Synthetic Data

Subscribe to Our Newsletter