Synthetic Data SDK
Project description
Synthetic Data SDK ✨
SDK Documentation | Platform Documentation | Usage Examples
The Synthetic Data SDK is a Python toolkit for high-fidelity, privacy-safe Synthetic Data.
- Client mode connects to a remote MOSTLY AI platform for training & generating synthetic data there.
- Local mode trains and generates synthetic data locally on your own compute resources.
- Generators, that were trained locally, can be easily imported to a platform for further sharing.
Overview
The SDK allows you to programmatically create, browse and manage 3 key resources:
- Generators - Train a synthetic data generator on your existing tabular or language data assets
- Synthetic Datasets - Use a generator to create any number of synthetic samples to your needs
- Connectors - Connect to any data source within your organization, for reading and writing data
Intent | Primitive | Documentation |
---|---|---|
Train a Generator on tabular or language data | g = mostly.train(config) |
see mostly.train |
Generate any number of synthetic data records | sd = mostly.generate(g, config) |
see mostly.generate |
Live probe the generator on demand | df = mostly.probe(g, config) |
see mostly.probe |
Connect to any data source within your org | c = mostly.connect(config) |
see mostly.connect |
Installation
Client mode only
pip install -U mostlyai
Client + Local mode
# for CPU on macOS
pip install -U 'mostlyai[local]'
# for CPU on Linux
#pip install -U mostlyai[local] --extra-index-url https://download.pytorch.org/whl/cpu
# for GPU on Linux
#pip install -U mostlyai[local-gpu]
Optional Connectors
Add any of the following extras for further data connectors support: databricks
, googlebigquery
, hive
, mssql
, mysql
, oracle
, postgres
, snowflake
.
E.g.
pip install -U 'mostlyai[local, databricks, snowflake]'
Quick Start 
Generate your first samples based on your own trained generator with a few lines of code. For local mode, initialize the SDK with local=True
. For client mode, initialize the SDK with base_url
and api_key
obtained from your account settings page.
import pandas as pd
from mostlyai.sdk import MostlyAI
# load original data
repo_url = "https://github.com/mostly-ai/public-demo-data/raw/refs/heads/dev"
df_original = pd.read_csv(f"{repo_url}/census/census.csv.gz").sample(n=5_000)
# initialize the SDK in local or client mode
mostly = MostlyAI(local=True) # local mode
# mostly = MostlyAI(base_url='xxx', api_key='xxx') # client mode
# train a synthetic data generator
g = mostly.train(
config={
"name": "US Census Income",
"tables": [
{
"name": "census",
"data": df_original,
"tabular_model_configuration": { # tabular model configuration (optional)
"max_training_time": 1, # - limit training time (in minutes)
# model, max_epochs,,.. # further model configurations (optional)
# 'differential_privacy': { # differential privacy configuration (optional)
# 'max_epsilon': 5.0, # - max epsilon value, used as stopping criterion
# 'delta': 1e-5, # - delta value
# }
},
# columns, keys, compute,.. # further table configurations (optional)
}
],
},
start=True, # start training immediately (default: True)
wait=True, # wait for completion (default: True)
)
Once the generator has been trained, you can use it to generate synthetic data samples. Either via probing:
# probe for some representative synthetic samples
df_samples = mostly.probe(g, size=100)
df_samples
or by creating a synthetic dataset entity for larger data volumes:
# generate a large representative synthetic dataset
sd = mostly.generate(g, size=100_000)
df_synthetic = sd.data()
df_synthetic
or by conditionally probing / generating synthetic data:
# create 100 seed records of 24y old Mexicans
df_seed = pd.DataFrame({
'age': [24] * 100,
'native_country': ['Mexico'] * 100,
})
# conditionally probe, based on provided seed
df_samples = mostly.probe(g, seed=df_seed)
df_samples
Key Features
- Broad Data Support
- Mixed-type data (categorical, numerical, geospatial, text, etc.)
- Single-table, multi-table, and time-series
- Multiple Model Types
- TabularARGN for SOTA tabular performance
- Fine-tune HuggingFace-based language models
- Efficient LSTM for text synthesis from scratch
- Advanced Training Options
- GPU/CPU support
- Differential Privacy
- Progress Monitoring
- Automated Quality Assurance
- Quality metrics for fidelity and privacy
- In-depth HTML reports for visual analysis
- Flexible Sampling
- Up-sample to any data volumes
- Conditional generation by any columns
- Re-balance underrepresented segments
- Context-aware data imputation
- Statistical fairness controls
- Rule-adherence via temperature
- Seamless Integration
- Connect to external data sources (DBs, cloud storages)
- Fully permissive open-source license
Citation
Please consider citing our project if you find it useful:
@software{mostlyai,
author = {{MOSTLY AI}},
title = {{MOSTLY AI SDK}},
url = {https://github.com/mostly-ai/mostlyai},
year = {2025}
}
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
File details
Details for the file mostlyai-4.1.3.tar.gz
.
File metadata
- Download URL: mostlyai-4.1.3.tar.gz
- Upload date:
- Size: 141.1 kB
- Tags: Source
- Uploaded using Trusted Publishing? Yes
- Uploaded via: twine/6.1.0 CPython/3.12.3
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | a71c231090089fb71e143a72302d7557f0e6fc0904ecc6d353933b0e8dadb5e8 |
|
MD5 | 580da7149c22b80dc318af2ec1f07be7 |
|
BLAKE2b-256 | 64906ef8e6ffc3888a01d0b9800d7ea55b25d3a0a933b0e4fd806f1a34d01cbb |
File details
Details for the file mostlyai-4.1.3-py3-none-any.whl
.
File metadata
- Download URL: mostlyai-4.1.3-py3-none-any.whl
- Upload date:
- Size: 206.3 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? Yes
- Uploaded via: twine/6.1.0 CPython/3.12.3
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | b0e0af96798c7629b91795fc16053fcdfc8e4aa188e4aa00acf233b0dc5f97ee |
|
MD5 | 39d34c24242000cc811aa7f557dc163a |
|
BLAKE2b-256 | 901783c1c6da70fa952624595dec5c8fbba0804bb3cae7aaac68efe3a5705ea0 |