Synthetic Data Engine
Project description
Synthetic Data Engine 💎
Documentation | Technical Paper | Free Cloud Service
Create high-fidelity privacy-safe synthetic data:
- train a generative model once:
- train on flat or sequential data
- control training time & params
- monitor training progress
- optionally enable differential privacy
- optionally provide context data
- generate synthetic data samples to your needs:
- up-sample / down-sample
- conditionally generate
- rebalance categories
- impute missing values
- incorporate fairness
- adjust sampling temperature
- predict / classify / regress
- detect outliers / anomalies
- and more
...all within your own compute environment, all with a few lines of Python code 💥.
Note: Models only need to be trained once and can then be flexibly reused for various downstream tasks — such as regression, classification, imputation, or sampling — without the need for retraining.
Two model classes with these methods are available:
TabularARGN(): For structured, flat or sequential tabular data.argn.fit(data): Train a TabularARGN modelargn.sample(n_samples): Generate samplesargn.predict(target, n_draws, agg_fn): Predict a featureargn.predict_proba(target): Estimate probabilitiesargn.log_prob(data): Compute log likelihoodargn.impute(data): Fill missing values
LanguageModel(): For semi-structured, flat textual tabular data..fit(data): Train a Language model.sample(n_samples): Generate samples
This library serves as the core model engine for the Synthetic Data SDK. For an easy-to-use, higher-level toolkit, please refer to the SDK.
Installation
It is highly recommended to install the package within a dedicated virtual environment using uv.
The latest release of mostlyai-engine can be installed via uv:
uv pip install -U mostlyai-engine
or alternatively for a GPU setup (needed for LLM finetuning and inference):
uv pip install -U 'mostlyai-engine[gpu]'
On Linux, one can explicitly install the CPU-only variant of PyTorch together with mostlyai-engine:
uv pip install --index-strategy unsafe-first-match -U \
torch==2.11.0+cpu torchvision==0.26.0+cpu torchaudio==2.11.0+cpu \
mostlyai-engine \
--extra-index-url https://download.pytorch.org/whl/cpu
TabularARGN for Flat Data
The TabularARGN class provides a scikit-learn-compatible interface for working with structured tabular data. It can be used for synthetic data generation, classification, regression, and imputation.
Model Training
Load your data and train the model:
import pandas as pd
from sklearn.model_selection import train_test_split
from mostlyai.engine import TabularARGN
# prepare data
data = pd.read_csv("https://github.com/user-attachments/files/23480587/census10k.csv.gz")
data_train, data_test = train_test_split(data, test_size=0.2)
# fit TabularARGN
argn = TabularARGN()
argn.fit(data_train)
Sampling / Synthetic Data Generation
Generate new synthetic samples:
# unconditional sampling
argn.sample(n_samples=1000)
Generate new synthetic samples conditionally:
# prepare seed
seed_data = pd.DataFrame({
"age": [25, 50],
"education": ["Bachelors", "HS-grad"]
})
# conditional sampling
argn.sample(seed_data=seed_data)
Imputation / Filling Gaps
Fill in missing values:
# prepare demo data with missings
data_with_missings = data_test.head(300).reset_index(drop=True)
data_with_missings.loc[0:299, "age"] = pd.NA
data_with_missings.loc[0:199, "race"] = pd.NA
data_with_missings.loc[100:299, "income"] = pd.NA
# impute missing values each with a random sample
data_imputed = argn.impute(data_with_missings)
# impute missing values each with their point estimates
data_imputed = argn.impute(data_with_missings, n_draws=100)
Predictions / Classification
Predict any categorical target column:
from sklearn.metrics import accuracy_score, roc_auc_score
# predict class labels for a categorical
predictions = argn.predict(data_test, target="income", n_draws=100, agg_fn="mode")
# model-conditional class probabilities (same inputs as predict; target column dropped from seed)
probabilities = argn.predict_proba(data_test, target="income")
# evaluate performance
accuracy = accuracy_score(data_test["income"], predictions["income"])
# AUC: sklearn needs binary 0/1 targets and scores for the "positive" class (here: second category)
pos_label = probabilities.columns[1]
y_true_bin = (data_test["income"] == pos_label).astype(int)
auc = roc_auc_score(y_true_bin, probabilities[pos_label])
print(f"Accuracy: {accuracy:.3f}, AUC: {auc:.3f}")
Predictions / Regression
Predict any numerical target column:
from sklearn.metrics import mean_absolute_error
# predict target values
predictions = argn.predict(data_test, target="age", n_draws=10, agg_fn="mean")
# evaluate performance
mae = mean_absolute_error(data_test["age"], predictions)
print(f"MAE: {mae:.1f} years")
Conditional Probabilities
Assess any marginal conditional probability, for one or more target columns:
# extract class probabilities for a categorical
argn.predict_proba(
X=pd.DataFrame({
"age": [25, 30, 35],
"sex": ["Male", "Female", "Male"],
}),
target="income"
)
# extract bin probabilities for a numerical
argn.predict_proba(
X=pd.DataFrame({
# "age": [25, 30, 35],
"sex": ["Male", "Female", "Male"],
"occupation": ["Craft-repair", "Craft-repair", "Craft-repair"]
}),
target="capital_gain"
)
# extract two-way marginals
argn.predict_proba(
X=data_test[["age", "race"]],
target=["sex", "income"]
)
Log Probability
Compute log likelihood of observations:
# compute log probability for each observation
log_probs = argn.log_prob(data_test)
# list top 10 outliers
data_test.iloc[log_probs.argsort()[:10]]
TabularARGN for Sequential Data
For sequential data (e.g., time series or event logs), specify the context key:
Model Training - With Context Data
import pandas as pd
from mostlyai.engine import TabularARGN
# load sequential data
tgt_data = pd.read_csv("https://github.com/user-attachments/files/23480787/batting.csv.gz")
ctx_data = pd.read_csv("https://github.com/user-attachments/files/23480786/players.csv.gz")
# fit TabularARGN with a context key column
argn = TabularARGN(
tgt_context_key="players_id",
ctx_primary_key="id",
ctx_data=ctx_data,
max_training_time=2, # 2 minutes
verbose=0,
)
argn.fit(tgt_data)
Sampling / Synthetic Data Generation
Generate new synthetic samples (using existing context):
argn.sample(n_samples=5)
Generate new synthetic samples conditionally (using custom context and seed):
ctx_data = pd.DataFrame({
"id": ["Player1", "Player2"],
"weight": [170, 160],
"height": [70, 68],
"bats": ["R", "L"],
"throws": ["R", "L"],
})
argn.sample(ctx_data=ctx_data)
Basic Usage of LanguageModel
The LanguageModel class provides a scikit-learn-compatible interface for working with semi-structured textual data. It leverages pre-trained language models or trains lightweight LSTM models from scratch to generate synthetic text data.
Note: The default model is MOSTLY_AI/LSTMFromScratch-3m, a lightweight LSTM model trained from scratch (GPU strongly recommended). You can also use pretrained Hugging Face models (model="<hub/repo>"; GPU required). Verified checkpoints include HuggingFaceTB/SmolLM2-135M, HuggingFaceTB/SmolLM3-3B, Qwen/Qwen3-0.6B, and microsoft/phi-4.
Model Training
Load your data and train the model:
import pandas as pd
from mostlyai.engine import LanguageModel
# load data
data = pd.read_csv("https://github.com/user-attachments/files/23486562/airbnb20k.csv.gz")
# fit LanguageModel
lm = LanguageModel(
model="MOSTLY_AI/LSTMFromScratch-3m",
tgt_encoding_types={
'neighbourhood': 'LANGUAGE_CATEGORICAL',
'title': 'LANGUAGE_TEXT',
},
max_training_time=10, # 10 minutes
verbose=1,
)
lm.fit(data)
Sampling / Synthetic Text Generation
Generate new synthetic samples using the trained language model:
# unconditional sampling
lm.sample(
n_samples=100,
sampling_temperature=0.8,
)
# prepare seed
seed_data = pd.DataFrame({
"neighbourhood": ["Westminster", "Hackney"],
})
# conditional sampling with seed values
lm.sample(
seed_data=seed_data,
sampling_temperature=0.8,
)
Further Examples
Example notebooks demonstrating various use cases are available in the examples directory:
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file mostlyai_engine-2.6.1.tar.gz.
File metadata
- Download URL: mostlyai_engine-2.6.1.tar.gz
- Upload date:
- Size: 140.8 kB
- Tags: Source
- Uploaded using Trusted Publishing? Yes
- Uploaded via: twine/6.1.0 CPython/3.13.12
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
f365f5712ae8198cd33e4cf204f08a702a24d9ef87b5ccc69c994f2313b7ac61
|
|
| MD5 |
f4add37db3e5409454c760b0b9f63fd7
|
|
| BLAKE2b-256 |
b0bfc0034fc622d3c829f53fcbe6e513ab0cd1d59105245939d51d8b0621933d
|
Provenance
The following attestation bundles were made for mostlyai_engine-2.6.1.tar.gz:
Publisher:
release-2-publish.yml on mostly-ai/mostlyai-engine
-
Statement:
-
Statement type:
https://in-toto.io/Statement/v1 -
Predicate type:
https://docs.pypi.org/attestations/publish/v1 -
Subject name:
mostlyai_engine-2.6.1.tar.gz -
Subject digest:
f365f5712ae8198cd33e4cf204f08a702a24d9ef87b5ccc69c994f2313b7ac61 - Sigstore transparency entry: 1409312679
- Sigstore integration time:
-
Permalink:
mostly-ai/mostlyai-engine@7527166702b24c0472907e4f257eef84212aa015 -
Branch / Tag:
refs/heads/main - Owner: https://github.com/mostly-ai
-
Access:
public
-
Token Issuer:
https://token.actions.githubusercontent.com -
Runner Environment:
github-hosted -
Publication workflow:
release-2-publish.yml@7527166702b24c0472907e4f257eef84212aa015 -
Trigger Event:
workflow_run
-
Statement type:
File details
Details for the file mostlyai_engine-2.6.1-py3-none-any.whl.
File metadata
- Download URL: mostlyai_engine-2.6.1-py3-none-any.whl
- Upload date:
- Size: 184.8 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? Yes
- Uploaded via: twine/6.1.0 CPython/3.13.12
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
63147a5c93e7d96c2b73513b4cc272ad55257106c83c0670bae47bc035572df0
|
|
| MD5 |
5739ddce9294fdb95e6ea13625ae8f12
|
|
| BLAKE2b-256 |
d204cb5f72b04842e971691a4686dc0a29fe3aeb917141e75b0864a42c69838c
|
Provenance
The following attestation bundles were made for mostlyai_engine-2.6.1-py3-none-any.whl:
Publisher:
release-2-publish.yml on mostly-ai/mostlyai-engine
-
Statement:
-
Statement type:
https://in-toto.io/Statement/v1 -
Predicate type:
https://docs.pypi.org/attestations/publish/v1 -
Subject name:
mostlyai_engine-2.6.1-py3-none-any.whl -
Subject digest:
63147a5c93e7d96c2b73513b4cc272ad55257106c83c0670bae47bc035572df0 - Sigstore transparency entry: 1409312683
- Sigstore integration time:
-
Permalink:
mostly-ai/mostlyai-engine@7527166702b24c0472907e4f257eef84212aa015 -
Branch / Tag:
refs/heads/main - Owner: https://github.com/mostly-ai
-
Access:
public
-
Token Issuer:
https://token.actions.githubusercontent.com -
Runner Environment:
github-hosted -
Publication workflow:
release-2-publish.yml@7527166702b24c0472907e4f257eef84212aa015 -
Trigger Event:
workflow_run
-
Statement type: