Skip to main content

Synthetic Data SDK

Project description

Synthetic Data SDK ✨

Documentation PyPI Downloads license GitHub Release PyPI - Python Version GitHub stars

SDK Documentation | Platform Documentation | Usage Examples

The Synthetic Data SDK is a Python toolkit for high-fidelity, privacy-safe Synthetic Data.

  • Client mode connects to a remote MOSTLY AI platform for training & generating synthetic data there.
  • Local mode trains and generates synthetic data locally on your own compute resources.
  • Generators, that were trained locally, can be easily imported to a platform for further sharing.

Overview

The SDK allows you to programmatically create, browse and manage 3 key resources:

  1. Generators - Train a synthetic data generator on your existing tabular or language data assets
  2. Synthetic Datasets - Use a generator to create any number of synthetic samples to your needs
  3. Connectors - Connect to any data source within your organization, for reading and writing data
Intent Primitive Documentation
Train a Generator on tabular or language data g = mostly.train(config) see mostly.train
Generate any number of synthetic data records sd = mostly.generate(g, config) see mostly.generate
Live probe the generator on demand df = mostly.probe(g, config) see mostly.probe
Connect to any data source within your org c = mostly.connect(config) see mostly.connect

Installation

Client mode only

pip install -U mostlyai

Client + Local mode

# for CPU on macOS
pip install -U 'mostlyai[local]'
# for CPU on Linux
#pip install -U mostlyai[local] --extra-index-url https://download.pytorch.org/whl/cpu
# for GPU on Linux
#pip install -U mostlyai[local-gpu]

Optional Connectors

Add any of the following extras for further data connectors support: databricks, googlebigquery, hive, mssql, mysql, oracle, postgres, snowflake.

E.g.

pip install -U 'mostlyai[local, databricks, snowflake]'

Quick Start Run on Colab

Generate your first samples based on your own trained generator with a few lines of code. For local mode, initialize the SDK with local=True. For client mode, initialize the SDK with base_url and api_key obtained from your account settings page.

import pandas as pd
from mostlyai.sdk import MostlyAI

# load original data
repo_url = "https://github.com/mostly-ai/public-demo-data/raw/refs/heads/dev"
df_original = pd.read_csv(f"{repo_url}/census/census.csv.gz").sample(n=5_000)

# initialize the SDK in local or client mode
mostly = MostlyAI(local=True)                       # local mode
# mostly = MostlyAI(base_url='xxx', api_key='xxx')  # client mode

# train a synthetic data generator
g = mostly.train(
    config={
        "name": "US Census Income",
        "tables": [
            {
                "name": "census",
                "data": df_original,
                "tabular_model_configuration": {  # tabular model configuration (optional)
                    "max_training_time": 1,  # - limit training time (in minutes)
                    # model, max_epochs,,..        # further model configurations (optional)
                    # 'differential_privacy': {    # differential privacy configuration (optional)
                    #     'max_epsilon': 5.0,      # - max epsilon value, used as stopping criterion
                    #     'delta': 1e-5,           # - delta value
                    # }
                },
                # columns, keys, compute,..      # further table configurations (optional)
            }
        ],
    },
    start=True,  # start training immediately (default: True)
    wait=True,  # wait for completion (default: True)
)

Once the generator has been trained, you can use it to generate synthetic data samples. Either via probing:

# probe for some representative synthetic samples
df_samples = mostly.probe(g, size=100)
df_samples

or by creating a synthetic dataset entity for larger data volumes:

# generate a large representative synthetic dataset
sd = mostly.generate(g, size=100_000)
df_synthetic = sd.data()
df_synthetic

or by conditionally probing / generating synthetic data:

# create 100 seed records of 24y old Mexicans
df_seed = pd.DataFrame({
    'age': [24] * 100,
    'native_country': ['Mexico'] * 100,
})
# conditionally probe, based on provided seed
df_samples = mostly.probe(g, seed=df_seed)
df_samples

Key Features

  • Broad Data Support
    • Mixed-type data (categorical, numerical, geospatial, text, etc.)
    • Single-table, multi-table, and time-series
  • Multiple Model Types
    • TabularARGN for SOTA tabular performance
    • Fine-tune HuggingFace-based language models
    • Efficient LSTM for text synthesis from scratch
  • Advanced Training Options
    • GPU/CPU support
    • Differential Privacy
    • Progress Monitoring
  • Automated Quality Assurance
    • Quality metrics for fidelity and privacy
    • In-depth HTML reports for visual analysis
  • Flexible Sampling
    • Up-sample to any data volumes
    • Conditional generation by any columns
    • Re-balance underrepresented segments
    • Context-aware data imputation
    • Statistical fairness controls
    • Rule-adherence via temperature
  • Seamless Integration
    • Connect to external data sources (DBs, cloud storages)
    • Fully permissive open-source license

Citation

Please consider citing our project if you find it useful:

@software{mostlyai,
    author = {{MOSTLY AI}},
    title = {{MOSTLY AI SDK}},
    url = {https://github.com/mostly-ai/mostlyai},
    year = {2025}
}

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

mostlyai-4.1.3.tar.gz (141.1 kB view details)

Uploaded Source

Built Distribution

mostlyai-4.1.3-py3-none-any.whl (206.3 kB view details)

Uploaded Python 3

File details

Details for the file mostlyai-4.1.3.tar.gz.

File metadata

  • Download URL: mostlyai-4.1.3.tar.gz
  • Upload date:
  • Size: 141.1 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.12.3

File hashes

Hashes for mostlyai-4.1.3.tar.gz
Algorithm Hash digest
SHA256 a71c231090089fb71e143a72302d7557f0e6fc0904ecc6d353933b0e8dadb5e8
MD5 580da7149c22b80dc318af2ec1f07be7
BLAKE2b-256 64906ef8e6ffc3888a01d0b9800d7ea55b25d3a0a933b0e4fd806f1a34d01cbb

See more details on using hashes here.

File details

Details for the file mostlyai-4.1.3-py3-none-any.whl.

File metadata

  • Download URL: mostlyai-4.1.3-py3-none-any.whl
  • Upload date:
  • Size: 206.3 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.12.3

File hashes

Hashes for mostlyai-4.1.3-py3-none-any.whl
Algorithm Hash digest
SHA256 b0e0af96798c7629b91795fc16053fcdfc8e4aa188e4aa00acf233b0dc5f97ee
MD5 39d34c24242000cc811aa7f557dc163a
BLAKE2b-256 901783c1c6da70fa952624595dec5c8fbba0804bb3cae7aaac68efe3a5705ea0

See more details on using hashes here.

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page