Synthetic Data Engine

These details have been verified by PyPI

Project links

GitHub Statistics

Maintainers

michiplatzer mostlyai

These details have not been verified by PyPI

Project description

Synthetic Data Engine 💎

GitHub Release license PyPI - Python Version

Documentation | Technical Paper | Free Cloud Service

Create high-fidelity privacy-safe synthetic data:

train a generative model once:
- train on flat or sequential data
- control training time & params
- monitor training progress
- optionally enable differential privacy
- optionally provide context data
generate synthetic data samples to your needs:
- up-sample / down-sample
- conditionally generate
- rebalance categories
- impute missing values
- incorporate fairness
- adjust sampling temperature
- predict / classify / regress
- detect outliers / anomalies
- and more

...all within your own compute environment, all with a few lines of Python code 💥.

Note: Models only need to be trained once and can then be flexibly reused for various downstream tasks — such as regression, classification, imputation, or sampling — without the need for retraining.

Two model classes with these methods are available:

TabularARGN(): For structured, flat or sequential tabular data.
- argn.fit(data): Train a TabularARGN model
- argn.sample(n_samples): Generate samples
- argn.predict(target, n_draws, agg_fn): Predict a feature
- argn.predict_proba(target): Estimate probabilities
- argn.log_prob(data): Compute log likelihood
- argn.impute(data): Fill missing values
LanguageModel(): For semi-structured, flat textual tabular data.
- .fit(data): Train a Language model
- .sample(n_samples): Generate samples

This library serves as the core model engine for the Synthetic Data SDK. For an easy-to-use, higher-level toolkit, please refer to the SDK.

Installation

It is highly recommended to install the package within a dedicated virtual environment using uv.

The latest release of mostlyai-engine can be installed via uv:

uv pip install -U mostlyai-engine

or alternatively for a GPU setup (needed for LLM finetuning and inference):

uv pip install -U 'mostlyai-engine[gpu]'

On Linux, one can explicitly install the CPU-only variant of PyTorch together with mostlyai-engine:

uv pip install --index-strategy unsafe-first-match -U \
  torch==2.11.0+cpu torchvision==0.26.0+cpu torchaudio==2.11.0+cpu \
  mostlyai-engine \
  --extra-index-url https://download.pytorch.org/whl/cpu

TabularARGN for Flat Data

The TabularARGN class provides a scikit-learn-compatible interface for working with structured tabular data. It can be used for synthetic data generation, classification, regression, and imputation.

Model Training

Load your data and train the model:

import pandas as pd
from sklearn.model_selection import train_test_split
from mostlyai.engine import TabularARGN

# prepare data
data = pd.read_csv("https://github.com/user-attachments/files/23480587/census10k.csv.gz")
data_train, data_test = train_test_split(data, test_size=0.2)

# fit TabularARGN
argn = TabularARGN()
argn.fit(data_train)

Sampling / Synthetic Data Generation

Generate new synthetic samples:

# unconditional sampling
argn.sample(n_samples=1000)

Generate new synthetic samples conditionally:

# prepare seed
seed_data = pd.DataFrame({
    "age": [25, 50],
    "education": ["Bachelors", "HS-grad"]
})

# conditional sampling
argn.sample(seed_data=seed_data)

Imputation / Filling Gaps

Fill in missing values:

# prepare demo data with missings
data_with_missings = data_test.head(300).reset_index(drop=True)
data_with_missings.loc[0:299, "age"] = pd.NA
data_with_missings.loc[0:199, "race"] = pd.NA
data_with_missings.loc[100:299, "income"] = pd.NA

# impute missing values each with a random sample
data_imputed = argn.impute(data_with_missings)

# impute missing values each with their point estimates
data_imputed = argn.impute(data_with_missings, n_draws=100)

Predictions / Classification

Predict any categorical target column:

from sklearn.metrics import accuracy_score, roc_auc_score

# predict class labels for a categorical
predictions = argn.predict(data_test, target="income", n_draws=100, agg_fn="mode")
# model-conditional class probabilities (same inputs as predict; target column dropped from seed)
probabilities = argn.predict_proba(data_test, target="income")

# evaluate performance
accuracy = accuracy_score(data_test["income"], predictions["income"])
# AUC: sklearn needs binary 0/1 targets and scores for the "positive" class (here: second category)
pos_label = probabilities.columns[1]
y_true_bin = (data_test["income"] == pos_label).astype(int)
auc = roc_auc_score(y_true_bin, probabilities[pos_label])
print(f"Accuracy: {accuracy:.3f}, AUC: {auc:.3f}")

Predictions / Regression

Predict any numerical target column:

from sklearn.metrics import mean_absolute_error

# predict target values
predictions = argn.predict(data_test, target="age", n_draws=10, agg_fn="mean")

# evaluate performance
mae = mean_absolute_error(data_test["age"], predictions)
print(f"MAE: {mae:.1f} years")

Conditional Probabilities

Assess any marginal conditional probability, for one or more target columns:

# extract class probabilities for a categorical
argn.predict_proba(
    X=pd.DataFrame({
        "age": [25, 30, 35],
        "sex": ["Male", "Female", "Male"],
    }),
    target="income"
)

# extract bin probabilities for a numerical
argn.predict_proba(
    X=pd.DataFrame({
        # "age": [25, 30, 35],
        "sex": ["Male", "Female", "Male"],
        "occupation": ["Craft-repair", "Craft-repair", "Craft-repair"]
    }),
    target="capital_gain"
)

# extract two-way marginals
argn.predict_proba(
    X=data_test[["age", "race"]],
    target=["sex", "income"]
)

Log Probability

Compute log likelihood of observations:

# compute log probability for each observation
log_probs = argn.log_prob(data_test)

# list top 10 outliers
data_test.iloc[log_probs.argsort()[:10]]

TabularARGN for Sequential Data

For sequential data (e.g., time series or event logs), specify the context key:

Model Training - With Context Data

import pandas as pd
from mostlyai.engine import TabularARGN

# load sequential data
tgt_data = pd.read_csv("https://github.com/user-attachments/files/23480787/batting.csv.gz")
ctx_data = pd.read_csv("https://github.com/user-attachments/files/23480786/players.csv.gz")

# fit TabularARGN with a context key column
argn = TabularARGN(
    tgt_context_key="players_id",
    ctx_primary_key="id",
    ctx_data=ctx_data,
    max_training_time=2,  # 2 minutes
    verbose=0,
)
argn.fit(tgt_data)

Sampling / Synthetic Data Generation

Generate new synthetic samples (using existing context):

argn.sample(n_samples=5)

Generate new synthetic samples conditionally (using custom context and seed):

ctx_data = pd.DataFrame({
    "id": ["Player1", "Player2"],
    "weight": [170, 160],
    "height": [70, 68],
    "bats": ["R", "L"],
    "throws": ["R", "L"],
})
argn.sample(ctx_data=ctx_data)

Basic Usage of LanguageModel

The LanguageModel class provides a scikit-learn-compatible interface for working with semi-structured textual data. It leverages pre-trained language models or trains lightweight LSTM models from scratch to generate synthetic text data.

Note: The default model is MOSTLY_AI/LSTMFromScratch-3m, a lightweight LSTM model trained from scratch (GPU strongly recommended). You can also use pretrained Hugging Face models (model="<hub/repo>"; GPU required). Verified checkpoints include HuggingFaceTB/SmolLM2-135M, HuggingFaceTB/SmolLM3-3B, Qwen/Qwen3-0.6B, and microsoft/phi-4.

Model Training

Load your data and train the model:

import pandas as pd
from mostlyai.engine import LanguageModel

# load data
data = pd.read_csv("https://github.com/user-attachments/files/23486562/airbnb20k.csv.gz")

# fit LanguageModel
lm = LanguageModel(
    model="MOSTLY_AI/LSTMFromScratch-3m",
    tgt_encoding_types={
        'neighbourhood': 'LANGUAGE_CATEGORICAL',
        'title': 'LANGUAGE_TEXT',
    },
    max_training_time=10,  # 10 minutes
    verbose=1,
)
lm.fit(data)

Sampling / Synthetic Text Generation

Generate new synthetic samples using the trained language model:

# unconditional sampling
lm.sample(
    n_samples=100,
    sampling_temperature=0.8,
)

# prepare seed
seed_data = pd.DataFrame({
    "neighbourhood": ["Westminster", "Hackney"],
})

# conditional sampling with seed values
lm.sample(
    seed_data=seed_data,
    sampling_temperature=0.8,
)

Further Examples

Example notebooks demonstrating various use cases are available in the examples directory:

TabularARGN for flat tabular data
TabularARGN for sequential data
LanguageModel for textual data

Project details

These details have been verified by PyPI

Project links

GitHub Statistics

Maintainers

michiplatzer mostlyai

These details have not been verified by PyPI

Release history Release notifications | RSS feed

2.6.2

May 8, 2026

This version

2.6.1

Apr 30, 2026

2.6.0

Apr 29, 2026

2.5.0

Apr 24, 2026

2.4.0

Jan 7, 2026

2.3.3

Dec 6, 2025

2.3.2

Dec 5, 2025

2.3.1

Nov 28, 2025

2.3.0

Nov 28, 2025

2.2.0

Nov 26, 2025

2.1.0

Nov 21, 2025

2.0.1

Nov 17, 2025

2.0.0

Nov 13, 2025

1.7.1

Nov 11, 2025

1.7.0

Nov 3, 2025

1.6.1

Oct 24, 2025

1.6.0

Oct 23, 2025

1.5.8

Oct 3, 2025

1.5.7

Sep 26, 2025

1.5.6

Sep 25, 2025

1.5.5

Sep 18, 2025

1.5.4

Sep 17, 2025

1.5.3

Sep 15, 2025

1.5.2

Sep 5, 2025

1.5.1

Aug 25, 2025

1.5.0

Aug 22, 2025

1.4.8

Jul 8, 2025

1.4.7

Jul 1, 2025

1.4.6

Jun 23, 2025

1.4.5

Jun 20, 2025

1.4.4

Jun 20, 2025

1.4.3

Jun 3, 2025

1.4.2

May 23, 2025

1.4.1

May 22, 2025

1.4.0

May 9, 2025

1.3.3

Apr 27, 2025

1.3.2

Apr 18, 2025

1.3.1

Apr 17, 2025

1.3.0

Apr 14, 2025

1.2.4

Apr 8, 2025

1.2.3

Apr 8, 2025

1.2.2

Apr 8, 2025

1.2.1

Apr 8, 2025

1.2.0

Apr 8, 2025

1.1.12

Apr 4, 2025

1.1.11

Apr 2, 2025

1.1.10

Mar 27, 2025

1.1.9

Mar 27, 2025

1.1.8

Mar 20, 2025

1.1.7

Mar 18, 2025

1.1.6

Mar 13, 2025

1.1.5

Mar 6, 2025

1.1.4

Feb 19, 2025

1.1.3

Feb 19, 2025

1.1.2

Feb 19, 2025

1.1.1

Feb 18, 2025

1.1.0

Feb 18, 2025

1.0.4

Feb 6, 2025

1.0.3

Jan 31, 2025

1.0.2

Jan 27, 2025

1.0.1

Jan 22, 2025

1.0.0

Jan 20, 2025

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

mostlyai_engine-2.6.1.tar.gz (140.8 kB view details)

Uploaded Apr 30, 2026 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

mostlyai_engine-2.6.1-py3-none-any.whl (184.8 kB view details)

Uploaded Apr 30, 2026 Python 3

File details

Details for the file mostlyai_engine-2.6.1.tar.gz.

File metadata

Download URL: mostlyai_engine-2.6.1.tar.gz
Upload date: Apr 30, 2026
Size: 140.8 kB
Tags: Source
Uploaded using Trusted Publishing? Yes
Uploaded via: twine/6.1.0 CPython/3.13.12

File hashes

Hashes for mostlyai_engine-2.6.1.tar.gz
Algorithm	Hash digest
SHA256	`f365f5712ae8198cd33e4cf204f08a702a24d9ef87b5ccc69c994f2313b7ac61`
MD5	`f4add37db3e5409454c760b0b9f63fd7`
BLAKE2b-256	`b0bfc0034fc622d3c829f53fcbe6e513ab0cd1d59105245939d51d8b0621933d`

See more details on using hashes here.

Provenance

The following attestation bundles were made for mostlyai_engine-2.6.1.tar.gz:

Publisher: release-2-publish.yml on mostly-ai/mostlyai-engine

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Statement:
- Statement type: https://in-toto.io/Statement/v1
- Predicate type: https://docs.pypi.org/attestations/publish/v1
- Subject name: mostlyai_engine-2.6.1.tar.gz
- Subject digest: f365f5712ae8198cd33e4cf204f08a702a24d9ef87b5ccc69c994f2313b7ac61
- Sigstore transparency entry: 1409312679
- Sigstore integration time: Apr 30, 2026
Source repository:
- Permalink: mostly-ai/mostlyai-engine@7527166702b24c0472907e4f257eef84212aa015
- Branch / Tag: refs/heads/main
- Owner: https://github.com/mostly-ai
- Access: public
Publication detail:
- Token Issuer: https://token.actions.githubusercontent.com
- Runner Environment: github-hosted
- Publication workflow: release-2-publish.yml@7527166702b24c0472907e4f257eef84212aa015
- Trigger Event: workflow_run

File details

Details for the file mostlyai_engine-2.6.1-py3-none-any.whl.

File metadata

Download URL: mostlyai_engine-2.6.1-py3-none-any.whl
Upload date: Apr 30, 2026
Size: 184.8 kB
Tags: Python 3
Uploaded using Trusted Publishing? Yes
Uploaded via: twine/6.1.0 CPython/3.13.12

File hashes

Hashes for mostlyai_engine-2.6.1-py3-none-any.whl
Algorithm	Hash digest
SHA256	`63147a5c93e7d96c2b73513b4cc272ad55257106c83c0670bae47bc035572df0`
MD5	`5739ddce9294fdb95e6ea13625ae8f12`
BLAKE2b-256	`d204cb5f72b04842e971691a4686dc0a29fe3aeb917141e75b0864a42c69838c`

See more details on using hashes here.

Provenance

The following attestation bundles were made for mostlyai_engine-2.6.1-py3-none-any.whl:

Publisher: release-2-publish.yml on mostly-ai/mostlyai-engine

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Statement:
- Statement type: https://in-toto.io/Statement/v1
- Predicate type: https://docs.pypi.org/attestations/publish/v1
- Subject name: mostlyai_engine-2.6.1-py3-none-any.whl
- Subject digest: 63147a5c93e7d96c2b73513b4cc272ad55257106c83c0670bae47bc035572df0
- Sigstore transparency entry: 1409312683
- Sigstore integration time: Apr 30, 2026
Source repository:
- Permalink: mostly-ai/mostlyai-engine@7527166702b24c0472907e4f257eef84212aa015
- Branch / Tag: refs/heads/main
- Owner: https://github.com/mostly-ai
- Access: public
Publication detail:
- Token Issuer: https://token.actions.githubusercontent.com
- Runner Environment: github-hosted
- Publication workflow: release-2-publish.yml@7527166702b24c0472907e4f257eef84212aa015
- Trigger Event: workflow_run

mostlyai-engine 2.6.1

Navigation

Verified details

Project links

GitHub Statistics

Maintainers

Unverified details

Meta

Classifiers

Project description

Synthetic Data Engine 💎

Installation

TabularARGN for Flat Data

Model Training

Sampling / Synthetic Data Generation

Imputation / Filling Gaps

Predictions / Classification

Predictions / Regression

Conditional Probabilities

Log Probability

TabularARGN for Sequential Data

Model Training - With Context Data

Sampling / Synthetic Data Generation

Basic Usage of LanguageModel

Model Training

Sampling / Synthetic Text Generation

Further Examples

Project details

Verified details

Project links

GitHub Statistics

Maintainers

Unverified details

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

Provenance

File details

File metadata

File hashes

Provenance