Synthetic Data Engine
Project description
Synthetic Data Engine 💎
Documentation | Technical Paper | Free Cloud Service
Create high-fidelity privacy-safe synthetic data:
- prepare, analyze, and encode original data
- train a generative model on the encoded data
- generate synthetic data samples to your needs:
- up-sample / down-sample
- conditionally generate
- rebalance categories
- impute missings
- incorporate fairness
- adjust sampling temperature
...all within your safe compute environment, all with a few lines of Python code 💥.
Note: This library is the underlying model engine of the Synthetic Data SDK. Please refer to the latter, for an easy-to-use, higher-level software toolkit.
Installation
The latest release of mostlyai-engine can be installed via pip:
pip install -U mostlyai-engine
or alternatively for a CPU-only setup:
pip install -U 'mostlyai-engine[cpu]' --extra-index-url https://download.pytorch.org/whl/cpu
or alternatively for a GPU setup (needed for LLM finetuning):
pip install -U 'mostlyai-engine[gpu]'
Quick start
Tabular Model: flat data, without context
from pathlib import Path
import pandas as pd
from mostlyai import engine
# set up workspace and default logging
ws = Path("ws-tabular-flat")
engine.init_logging()
# load original data
url = "https://github.com/mostly-ai/public-demo-data/raw/refs/heads/dev/census"
trn_df = pd.read_csv(f"{url}/census.csv.gz")
# execute the engine steps
engine.split( # split data as PQT files for `trn` + `val` to `{ws}/OriginalData/tgt-data`
workspace_dir=ws,
tgt_data=trn_df,
model_type="TABULAR",
)
engine.analyze(workspace_dir=ws) # generate column-level statistics to `{ws}/ModelData/tgt-stats/stats.json`
engine.encode(workspace_dir=ws) # encode training data to `{ws}/OriginalData/encoded-data`
engine.train( # train model and store to `{ws}/ModelStore/model-data`
workspace_dir=ws,
max_training_time=1, # limit TRAIN to 1 minute for demo purposes
)
engine.generate(workspace_dir=ws) # use model to generate synthetic samples to `{ws}/SyntheticData`
pd.read_parquet(ws / "SyntheticData") # load synthetic data
Tabular Model: sequential data, with context
from pathlib import Path
import pandas as pd
from mostlyai import engine
engine.init_logging()
# set up workspace and default logging
ws = Path("ws-tabular-sequential")
engine.init_logging()
# load original data
url = "https://github.com/mostly-ai/public-demo-data/raw/refs/heads/dev/baseball"
trn_ctx_df = pd.read_csv(f"{url}/players.csv.gz") # context data
trn_tgt_df = pd.read_csv(f"{url}/batting.csv.gz") # target data
# execute the engine steps
engine.split( # split data as PQT files for `trn` + `val` to `{ws}/OriginalData/(tgt|ctx)-data`
workspace_dir=ws,
tgt_data=trn_tgt_df,
ctx_data=trn_ctx_df,
tgt_context_key="players_id",
ctx_primary_key="id",
model_type="TABULAR",
)
engine.analyze(workspace_dir=ws) # generate column-level statistics to `{ws}/ModelStore/(tgt|ctx)-data/stats.json`
engine.encode(workspace_dir=ws) # encode training data to `{ws}/OriginalData/encoded-data`
engine.train( # train model and store to `{ws}/ModelStore/model-data`
workspace_dir=ws,
max_training_time=1, # limit TRAIN to 1 minute for demo purposes
)
engine.generate(workspace_dir=ws) # use model to generate synthetic samples to `{ws}/SyntheticData`
pd.read_parquet(ws / "SyntheticData") # load synthetic data
Language Model: flat data, without context
from pathlib import Path
import pandas as pd
from mostlyai import engine
# init workspace and logging
ws = Path("ws-language-flat")
engine.init_logging()
# load original data
trn_df = pd.read_parquet("https://github.com/mostly-ai/public-demo-data/raw/refs/heads/dev/headlines/headlines.parquet")
trn_df = trn_df.sample(n=10_000, random_state=42)
# execute the engine steps
engine.split( # split data as PQT files for `trn` + `val` to `{ws}/OriginalData/tgt-data`
workspace_dir=ws,
tgt_data=trn_df,
tgt_encoding_types={
'category': 'LANGUAGE_CATEGORICAL',
'date': 'LANGUAGE_DATETIME',
'headline': 'LANGUAGE_TEXT',
}
)
engine.analyze(workspace_dir=ws) # generate column-level statistics to `{ws}/ModelStore/tgt-stats/stats.json`
engine.encode(workspace_dir=ws) # encode training data to `{ws}/OriginalData/encoded-data`
engine.train( # train model and store to `{ws}/ModelStore/model-data`
workspace_dir=ws,
max_training_time=2, # limit TRAIN to 2 minute for demo purposes
model="MOSTLY_AI/LSTMFromScratch-3m", # use a light-weight LSTM model, trained from scratch (GPU recommended)
# model="microsoft/phi-1.5", # alternatively use a pre-trained HF-hosted LLM model (GPU required)
)
engine.generate( # use model to generate synthetic samples to `{ws}/SyntheticData`
workspace_dir=ws,
sample_size=10,
)
pd.read_parquet(ws / "SyntheticData") # load synthetic data
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file mostlyai_engine-1.1.7.tar.gz.
File metadata
- Download URL: mostlyai_engine-1.1.7.tar.gz
- Upload date:
- Size: 107.3 kB
- Tags: Source
- Uploaded using Trusted Publishing? Yes
- Uploaded via: twine/6.1.0 CPython/3.12.3
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
e64fd24cae69b39e770ad465898978e2b5efd3d89f6a87781906fc088a4c841a
|
|
| MD5 |
f6377a415b531c323073cdc566da6922
|
|
| BLAKE2b-256 |
2eb226237c49b97a02e99f5408c382bc61a8e651432c593968a194b7ecc60418
|
File details
Details for the file mostlyai_engine-1.1.7-py3-none-any.whl.
File metadata
- Download URL: mostlyai_engine-1.1.7-py3-none-any.whl
- Upload date:
- Size: 146.4 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? Yes
- Uploaded via: twine/6.1.0 CPython/3.12.3
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
5b0296031d590b39cb760ceb87726ed0f217b81cd3ea1de032186b6042427891
|
|
| MD5 |
8f28df21e157e1df4bcf1294a9af6960
|
|
| BLAKE2b-256 |
70701231470a6ffa3f844459994fdf623396876a304e92939955afa3b0c77cbf
|