Synthetic Data Engine

These details have not been verified by PyPI

Project links

Project description

Synthetic Data Engine 💎

GitHub Release license PyPI - Python Version

Documentation | Technical Paper | Free Cloud Service

Create high-fidelity privacy-safe synthetic data:

prepare, analyze, and encode original data
train a generative model on the encoded data
generate synthetic data samples to your needs:
- up-sample / down-sample
- conditionally generate
- rebalance categories
- impute missings
- incorporate fairness
- adjust sampling temperature

...all within your safe compute environment, all with a few lines of Python code 💥.

Note: This library is the underlying model engine of the Synthetic Data SDK. Please refer to the latter, for an easy-to-use, higher-level software toolkit.

Installation

The latest release of mostlyai-engine can be installed via pip:

pip install -U mostlyai-engine

or alternatively for a GPU setup (needed for LLM finetuning and inference):

pip install -U 'mostlyai-engine[gpu]'

On Linux, one can explicitly install the CPU-only variant of torch together with mostlyai-engine:

pip install -U torch==2.6.0+cpu torchvision==0.21.0+cpu mostlyai-engine --extra-index-url https://download.pytorch.org/whl/cpu

Quick start

Tabular Model: flat data, without context

from pathlib import Path
import pandas as pd
from mostlyai import engine

# set up workspace and default logging
ws = Path("ws-tabular-flat")
engine.init_logging()

# load original data
url = "https://github.com/mostly-ai/public-demo-data/raw/refs/heads/dev/census"
trn_df = pd.read_csv(f"{url}/census.csv.gz")

# execute the engine steps
engine.split(                         # split data as PQT files for `trn` + `val` to `{ws}/OriginalData/tgt-data`
  workspace_dir=ws,
  tgt_data=trn_df,
  model_type="TABULAR",
)
engine.analyze(workspace_dir=ws)      # generate column-level statistics to `{ws}/ModelData/tgt-stats/stats.json`
engine.encode(workspace_dir=ws)       # encode training data to `{ws}/OriginalData/encoded-data`
engine.train(                         # train model and store to `{ws}/ModelStore/model-data`
    workspace_dir=ws,
    max_training_time=1,              # limit TRAIN to 1 minute for demo purposes
)
engine.generate(workspace_dir=ws)     # use model to generate synthetic samples to `{ws}/SyntheticData`
pd.read_parquet(ws / "SyntheticData") # load synthetic data

Tabular Model: sequential data, with context

from pathlib import Path
import pandas as pd
from mostlyai import engine

engine.init_logging()

# set up workspace and default logging
ws = Path("ws-tabular-sequential")
engine.init_logging()

# load original data
url = "https://github.com/mostly-ai/public-demo-data/raw/refs/heads/dev/baseball"
trn_ctx_df = pd.read_csv(f"{url}/players.csv.gz")  # context data
trn_tgt_df = pd.read_csv(f"{url}/batting.csv.gz")  # target data

# execute the engine steps
engine.split(                         # split data as PQT files for `trn` + `val` to `{ws}/OriginalData/(tgt|ctx)-data`
  workspace_dir=ws,
  tgt_data=trn_tgt_df,
  ctx_data=trn_ctx_df,
  tgt_context_key="players_id",
  ctx_primary_key="id",
  model_type="TABULAR",
)
engine.analyze(workspace_dir=ws)      # generate column-level statistics to `{ws}/ModelStore/(tgt|ctx)-data/stats.json`
engine.encode(workspace_dir=ws)       # encode training data to `{ws}/OriginalData/encoded-data`
engine.train(                         # train model and store to `{ws}/ModelStore/model-data`
    workspace_dir=ws,
    max_training_time=1,              # limit TRAIN to 1 minute for demo purposes
)
engine.generate(workspace_dir=ws)     # use model to generate synthetic samples to `{ws}/SyntheticData`
pd.read_parquet(ws / "SyntheticData") # load synthetic data

Language Model: flat data, without context

from pathlib import Path
import pandas as pd
from mostlyai import engine

# init workspace and logging
ws = Path("ws-language-flat")
engine.init_logging()

# load original data
trn_df = pd.read_parquet("https://github.com/mostly-ai/public-demo-data/raw/refs/heads/dev/headlines/headlines.parquet")
trn_df = trn_df.sample(n=10_000, random_state=42)

# execute the engine steps
engine.split(                         # split data as PQT files for `trn` + `val` to `{ws}/OriginalData/tgt-data`
    workspace_dir=ws,
    tgt_data=trn_df,
    tgt_encoding_types={
        'category': 'LANGUAGE_CATEGORICAL',
        'date': 'LANGUAGE_DATETIME',
        'headline': 'LANGUAGE_TEXT',
    }
)
engine.analyze(workspace_dir=ws)      # generate column-level statistics to `{ws}/ModelStore/tgt-stats/stats.json`
engine.encode(workspace_dir=ws)       # encode training data to `{ws}/OriginalData/encoded-data`
engine.train(                         # train model and store to `{ws}/ModelStore/model-data`
    workspace_dir=ws,
    max_training_time=2,                   # limit TRAIN to 2 minute for demo purposes
    model="MOSTLY_AI/LSTMFromScratch-3m",  # use a light-weight LSTM model, trained from scratch (GPU recommended)
    # model="microsoft/phi-1.5",           # alternatively use a pre-trained HF-hosted LLM model (GPU required)
)
engine.generate(                      # use model to generate synthetic samples to `{ws}/SyntheticData`
    workspace_dir=ws,
    sample_size=10,
)
pd.read_parquet(ws / "SyntheticData") # load synthetic data

Project details

These details have not been verified by PyPI

Project links

Release history Release notifications | RSS feed

2.6.2

May 8, 2026

2.6.1

Apr 30, 2026

2.6.0

Apr 29, 2026

2.5.0

Apr 24, 2026

2.4.0

Jan 7, 2026

2.3.3

Dec 6, 2025

2.3.2

Dec 5, 2025

2.3.1

Nov 28, 2025

2.3.0

Nov 28, 2025

2.2.0

Nov 26, 2025

2.1.0

Nov 21, 2025

2.0.1

Nov 17, 2025

2.0.0

Nov 13, 2025

1.7.1

Nov 11, 2025

1.7.0

Nov 3, 2025

1.6.1

Oct 24, 2025

1.6.0

Oct 23, 2025

1.5.8

Oct 3, 2025

1.5.7

Sep 26, 2025

1.5.6

Sep 25, 2025

1.5.5

Sep 18, 2025

1.5.4

Sep 17, 2025

1.5.3

Sep 15, 2025

1.5.2

Sep 5, 2025

1.5.1

Aug 25, 2025

1.5.0

Aug 22, 2025

1.4.8

Jul 8, 2025

1.4.7

Jul 1, 2025

1.4.6

Jun 23, 2025

1.4.5

Jun 20, 2025

1.4.4

Jun 20, 2025

1.4.3

Jun 3, 2025

1.4.2

May 23, 2025

1.4.1

May 22, 2025

1.4.0

May 9, 2025

1.3.3

Apr 27, 2025

1.3.2

Apr 18, 2025

1.3.1

Apr 17, 2025

This version

1.3.0

Apr 14, 2025

1.2.4

Apr 8, 2025

1.2.3

Apr 8, 2025

1.2.2

Apr 8, 2025

1.2.1

Apr 8, 2025

1.2.0

Apr 8, 2025

1.1.12

Apr 4, 2025

1.1.11

Apr 2, 2025

1.1.10

Mar 27, 2025

1.1.9

Mar 27, 2025

1.1.8

Mar 20, 2025

1.1.7

Mar 18, 2025

1.1.6

Mar 13, 2025

1.1.5

Mar 6, 2025

1.1.4

Feb 19, 2025

1.1.3

Feb 19, 2025

1.1.2

Feb 19, 2025

1.1.1

Feb 18, 2025

1.1.0

Feb 18, 2025

1.0.4

Feb 6, 2025

1.0.3

Jan 31, 2025

1.0.2

Jan 27, 2025

1.0.1

Jan 22, 2025

1.0.0

Jan 20, 2025

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

mostlyai_engine-1.3.0.tar.gz (106.9 kB view details)

Uploaded Apr 14, 2025 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

mostlyai_engine-1.3.0-py3-none-any.whl (146.3 kB view details)

Uploaded Apr 14, 2025 Python 3

File details

Details for the file mostlyai_engine-1.3.0.tar.gz.

File metadata

Download URL: mostlyai_engine-1.3.0.tar.gz
Upload date: Apr 14, 2025
Size: 106.9 kB
Tags: Source
Uploaded using Trusted Publishing? Yes
Uploaded via: twine/6.1.0 CPython/3.12.3

File hashes

Hashes for mostlyai_engine-1.3.0.tar.gz
Algorithm	Hash digest
SHA256	`7e69cc4af00929ce643a620b97e49021f2a67d4b56a7963c928e067a1df78d8c`
MD5	`3d8a6ff50026811f78a2faeb33a20f33`
BLAKE2b-256	`2f4b421903f590a341cf1b4920879306c3064f8ecf0fd84860de45c95d31f223`

See more details on using hashes here.

File details

Details for the file mostlyai_engine-1.3.0-py3-none-any.whl.

File metadata

Download URL: mostlyai_engine-1.3.0-py3-none-any.whl
Upload date: Apr 14, 2025
Size: 146.3 kB
Tags: Python 3
Uploaded using Trusted Publishing? Yes
Uploaded via: twine/6.1.0 CPython/3.12.3

File hashes

Hashes for mostlyai_engine-1.3.0-py3-none-any.whl
Algorithm	Hash digest
SHA256	`a4f70dc33627005a97c3c0e90839b57768ab6621c9c10add294959743e33b79e`
MD5	`3f7a61d522127c60182fc74c06769544`
BLAKE2b-256	`2fbbdf1bd8b9f0c3d19ad2d63cb7b473b72aabfae6040fb8fdb3dbfd1cfaa137`

See more details on using hashes here.

mostlyai-engine 1.3.0

Navigation

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Project description

Synthetic Data Engine 💎

Installation

Quick start

Tabular Model: flat data, without context

Tabular Model: sequential data, with context

Language Model: flat data, without context

Project details

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

File details

File metadata

File hashes