Back to Catalog
Data Science
Data Augmentation
Synthetic Data Generation
Creates artificial data that mimics real data statistical properties for training and testing.
Intent & Description
📋 Context
Real data may be limited, private, or imbalanced. Synthetic data generation creates realistic artificial data to augment training sets while preserving privacy.
Real-world Use Case
Data augmentation for limited datasets, privacy-sensitive applications, and testing ML systems with diverse data.
Advantages
- Preserves privacy
- Unlimited data generation
- Balances imbalanced datasets
- Enables rapid prototyping
Disadvantages
- May not capture all patterns
- Quality validation required
- Generation complexity
- Risk of unrealistic data
Implementation Example
# Synthetic Data Generation Pattern from sklearn.datasets import make_classification
# Generate synthetic classification data X_synthetic, y_synthetic = make_classification( n_samples=1000, n_features=20, n_informative=10, n_redundant=5, random_state=42 )
# Combine with real data for training X_combined = np.vstack([X_real, X_synthetic]) y_combined = np.hstack([y_real, y_synthetic])