Skip to main content

Synthetic Data Engine

Project description

Synthetic Data Engine 💎

GitHub Release Documentation stats license PyPI - Python Version

Documentation | Technical Paper | Free Cloud Service

Create high-fidelity privacy-safe synthetic data:

  1. prepare, analyze, and encode original data
  2. train a generative model on the encoded data
  3. generate synthetic data samples to your needs:
    • up-sample / down-sample
    • conditionally generate
    • rebalance categories
    • impute missings
    • incorporate fairness
    • adjust sampling temperature

...all within your safe compute environment, all with a few lines of Python code 💥.

Note: This library is the underlying model engine of the Synthetic Data SDK. Please refer to the latter, for an easy-to-use, higher-level software toolkit.

Installation

The latest release of mostlyai-engine can be installed via pip:

pip install -U mostlyai-engine

or alternatively for a GPU setup (needed for LLM finetuning and inference):

pip install -U 'mostlyai-engine[gpu]'

On Linux, one can explicitly install the CPU-only variant of torch together with mostlyai-engine:

pip install -U torch==2.6.0+cpu torchvision==0.21.0+cpu mostlyai-engine --extra-index-url https://download.pytorch.org/whl/cpu

Quick start

Tabular Model: flat data, without context

from pathlib import Path
import pandas as pd
from mostlyai import engine

# set up workspace and default logging
ws = Path("ws-tabular-flat")
engine.init_logging()

# load original data
url = "https://github.com/mostly-ai/public-demo-data/raw/refs/heads/dev/census"
trn_df = pd.read_csv(f"{url}/census.csv.gz")

# execute the engine steps
engine.split(                         # split data as PQT files for `trn` + `val` to `{ws}/OriginalData/tgt-data`
  workspace_dir=ws,
  tgt_data=trn_df,
  model_type="TABULAR",
)
engine.analyze(workspace_dir=ws)      # generate column-level statistics to `{ws}/ModelData/tgt-stats/stats.json`
engine.encode(workspace_dir=ws)       # encode training data to `{ws}/OriginalData/encoded-data`
engine.train(                         # train model and store to `{ws}/ModelStore/model-data`
    workspace_dir=ws,
    max_training_time=1,              # limit TRAIN to 1 minute for demo purposes
)
engine.generate(workspace_dir=ws)     # use model to generate synthetic samples to `{ws}/SyntheticData`
pd.read_parquet(ws / "SyntheticData") # load synthetic data

Tabular Model: sequential data, with context

from pathlib import Path
import pandas as pd
from mostlyai import engine

engine.init_logging()

# set up workspace and default logging
ws = Path("ws-tabular-sequential")
engine.init_logging()

# load original data
url = "https://github.com/mostly-ai/public-demo-data/raw/refs/heads/dev/baseball"
trn_ctx_df = pd.read_csv(f"{url}/players.csv.gz")  # context data
trn_tgt_df = pd.read_csv(f"{url}/batting.csv.gz")  # target data

# execute the engine steps
engine.split(                         # split data as PQT files for `trn` + `val` to `{ws}/OriginalData/(tgt|ctx)-data`
  workspace_dir=ws,
  tgt_data=trn_tgt_df,
  ctx_data=trn_ctx_df,
  tgt_context_key="players_id",
  ctx_primary_key="id",
  model_type="TABULAR",
)
engine.analyze(workspace_dir=ws)      # generate column-level statistics to `{ws}/ModelStore/(tgt|ctx)-data/stats.json`
engine.encode(workspace_dir=ws)       # encode training data to `{ws}/OriginalData/encoded-data`
engine.train(                         # train model and store to `{ws}/ModelStore/model-data`
    workspace_dir=ws,
    max_training_time=1,              # limit TRAIN to 1 minute for demo purposes
)
engine.generate(workspace_dir=ws)     # use model to generate synthetic samples to `{ws}/SyntheticData`
pd.read_parquet(ws / "SyntheticData") # load synthetic data

Language Model: flat data, without context

from pathlib import Path
import pandas as pd
from mostlyai import engine

# init workspace and logging
ws = Path("ws-language-flat")
engine.init_logging()

# load original data
trn_df = pd.read_parquet("https://github.com/mostly-ai/public-demo-data/raw/refs/heads/dev/headlines/headlines.parquet")
trn_df = trn_df.sample(n=10_000, random_state=42)

# execute the engine steps
engine.split(                         # split data as PQT files for `trn` + `val` to `{ws}/OriginalData/tgt-data`
    workspace_dir=ws,
    tgt_data=trn_df,
    tgt_encoding_types={
        'category': 'LANGUAGE_CATEGORICAL',
        'date': 'LANGUAGE_DATETIME',
        'headline': 'LANGUAGE_TEXT',
    }
)
engine.analyze(workspace_dir=ws)      # generate column-level statistics to `{ws}/ModelStore/tgt-stats/stats.json`
engine.encode(workspace_dir=ws)       # encode training data to `{ws}/OriginalData/encoded-data`
engine.train(                         # train model and store to `{ws}/ModelStore/model-data`
    workspace_dir=ws,
    max_training_time=2,                   # limit TRAIN to 2 minute for demo purposes
    model="MOSTLY_AI/LSTMFromScratch-3m",  # use a light-weight LSTM model, trained from scratch (GPU recommended)
    # model="microsoft/phi-1.5",           # alternatively use a pre-trained HF-hosted LLM model (GPU required)
)
engine.generate(                      # use model to generate synthetic samples to `{ws}/SyntheticData`
    workspace_dir=ws,
    sample_size=10,
)
pd.read_parquet(ws / "SyntheticData") # load synthetic data

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

mostlyai_engine-1.4.1.tar.gz (113.0 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

mostlyai_engine-1.4.1-py3-none-any.whl (152.7 kB view details)

Uploaded Python 3

File details

Details for the file mostlyai_engine-1.4.1.tar.gz.

File metadata

  • Download URL: mostlyai_engine-1.4.1.tar.gz
  • Upload date:
  • Size: 113.0 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.12.3

File hashes

Hashes for mostlyai_engine-1.4.1.tar.gz
Algorithm Hash digest
SHA256 2f77e31e64ffca81d30de4b97a30cbd7ca6262400e61c7a797daebef19af0e0a
MD5 5a6edc3ba4aafe3b5142d3b7c8ce5414
BLAKE2b-256 03c00b94965f0061d09045da28b690d32377d91a64bf71714b69591a2d3fc0ae

See more details on using hashes here.

File details

Details for the file mostlyai_engine-1.4.1-py3-none-any.whl.

File metadata

  • Download URL: mostlyai_engine-1.4.1-py3-none-any.whl
  • Upload date:
  • Size: 152.7 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.12.3

File hashes

Hashes for mostlyai_engine-1.4.1-py3-none-any.whl
Algorithm Hash digest
SHA256 d8f883a349dad37ee1eff7887b2c8f27b4516ea542d13f975fba69ff92aa831e
MD5 5cf602a3d7da6715555707a71019f115
BLAKE2b-256 de8e3f0ebfb02f97aaeda936dbfb5135e7bfe381bba3c6ec20aaf62d60f8c790

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page