Skip to main content

Synthetic Data Engine

Project description

Synthetic Data Engine 💎

GitHub Release Documentation stats license PyPI - Python Version

Documentation | Technical Paper | Free Cloud Service

Create high-fidelity privacy-safe synthetic data:

  1. train a generative model once:
    • train on flat or sequential data
    • control training time & params
    • monitor training progress
    • optionally enable differential privacy
    • optionally provide context data
  2. generate synthetic data samples to your needs:
    • up-sample / down-sample
    • conditionally generate
    • rebalance categories
    • impute missing values
    • incorporate fairness
    • adjust sampling temperature
    • predict / classify / regress
    • detect outliers / anomalies
    • and more

...all within your own compute environment, all with a few lines of Python code 💥.

Note: Models only need to be trained once and can then be flexibly reused for various downstream tasks — such as regression, classification, imputation, or sampling — without the need for retraining.

Two model classes with these methods are available:

  1. TabularARGN(): For structured, flat or sequential tabular data.
    • argn.fit(data): Train a TabularARGN model
    • argn.sample(n_samples): Generate samples
    • argn.predict(target, n_draws, agg_fn): Predict a feature
    • argn.predict_proba(target): Estimate probabilities
    • argn.log_prob(data): Compute log likelihood
    • argn.impute(data): Fill missing values
  2. LanguageModel(): For semi-structured, flat textual tabular data.
    • .fit(data): Train a Language model
    • .sample(n_samples): Generate samples

This library serves as the core model engine for the Synthetic Data SDK. For an easy-to-use, higher-level toolkit, please refer to the SDK.

Installation

It is highly recommended to install the package within a dedicated virtual environment using uv.

The latest release of mostlyai-engine can be installed via uv:

uv pip install -U mostlyai-engine

or alternatively for a GPU setup (needed for LLM finetuning and inference):

uv pip install -U 'mostlyai-engine[gpu]'

On Linux, one can explicitly install the CPU-only variant of PyTorch together with mostlyai-engine:

uv pip install --index-strategy unsafe-first-match -U \
  torch==2.11.0+cpu torchvision==0.26.0+cpu torchaudio==2.11.0+cpu \
  mostlyai-engine \
  --extra-index-url https://download.pytorch.org/whl/cpu

TabularARGN for Flat Data

The TabularARGN class provides a scikit-learn-compatible interface for working with structured tabular data. It can be used for synthetic data generation, classification, regression, and imputation.

Model Training

Load your data and train the model:

import pandas as pd
from sklearn.model_selection import train_test_split
from mostlyai.engine import TabularARGN

# prepare data
data = pd.read_csv("https://github.com/user-attachments/files/23480587/census10k.csv.gz")
data_train, data_test = train_test_split(data, test_size=0.2)

# fit TabularARGN
argn = TabularARGN()
argn.fit(data_train)

Sampling / Synthetic Data Generation

Generate new synthetic samples:

# unconditional sampling
argn.sample(n_samples=1000)

Generate new synthetic samples conditionally:

# prepare seed
seed_data = pd.DataFrame({
    "age": [25, 50],
    "education": ["Bachelors", "HS-grad"]
})

# conditional sampling
argn.sample(seed_data=seed_data)

Imputation / Filling Gaps

Fill in missing values:

# prepare demo data with missings
data_with_missings = data_test.head(300).reset_index(drop=True)
data_with_missings.loc[0:299, "age"] = pd.NA
data_with_missings.loc[0:199, "race"] = pd.NA
data_with_missings.loc[100:299, "income"] = pd.NA

# impute missing values each with a random sample
data_imputed = argn.impute(data_with_missings)

# impute missing values each with their point estimates
data_imputed = argn.impute(data_with_missings, n_draws=100)

Predictions / Classification

Predict any categorical target column:

from sklearn.metrics import accuracy_score, roc_auc_score

# predict class labels for a categorical
predictions = argn.predict(data_test, target="income", n_draws=100, agg_fn="mode")
# model-conditional class probabilities (same inputs as predict; target column dropped from seed)
probabilities = argn.predict_proba(data_test, target="income")

# evaluate performance
accuracy = accuracy_score(data_test["income"], predictions["income"])
# AUC: sklearn needs binary 0/1 targets and scores for the "positive" class (here: second category)
pos_label = probabilities.columns[1]
y_true_bin = (data_test["income"] == pos_label).astype(int)
auc = roc_auc_score(y_true_bin, probabilities[pos_label])
print(f"Accuracy: {accuracy:.3f}, AUC: {auc:.3f}")

Predictions / Regression

Predict any numerical target column:

from sklearn.metrics import mean_absolute_error

# predict target values
predictions = argn.predict(data_test, target="age", n_draws=10, agg_fn="mean")

# evaluate performance
mae = mean_absolute_error(data_test["age"], predictions)
print(f"MAE: {mae:.1f} years")

Conditional Probabilities

Assess any marginal conditional probability, for one or more target columns:

# extract class probabilities for a categorical
argn.predict_proba(
    X=pd.DataFrame({
        "age": [25, 30, 35],
        "sex": ["Male", "Female", "Male"],
    }),
    target="income"
)

# extract bin probabilities for a numerical
argn.predict_proba(
    X=pd.DataFrame({
        # "age": [25, 30, 35],
        "sex": ["Male", "Female", "Male"],
        "occupation": ["Craft-repair", "Craft-repair", "Craft-repair"]
    }),
    target="capital_gain"
)

# extract two-way marginals
argn.predict_proba(
    X=data_test[["age", "race"]],
    target=["sex", "income"]
)

Log Probability

Compute log likelihood of observations:

# compute log probability for each observation
log_probs = argn.log_prob(data_test)

# list top 10 outliers
data_test.iloc[log_probs.argsort()[:10]]

TabularARGN for Sequential Data

For sequential data (e.g., time series or event logs), specify the context key:

Model Training - With Context Data

import pandas as pd
from mostlyai.engine import TabularARGN

# load sequential data
tgt_data = pd.read_csv("https://github.com/user-attachments/files/23480787/batting.csv.gz")
ctx_data = pd.read_csv("https://github.com/user-attachments/files/23480786/players.csv.gz")

# fit TabularARGN with a context key column
argn = TabularARGN(
    tgt_context_key="players_id",
    ctx_primary_key="id",
    ctx_data=ctx_data,
    max_training_time=2,  # 2 minutes
    verbose=0,
)
argn.fit(tgt_data)

Sampling / Synthetic Data Generation

Generate new synthetic samples (using existing context):

argn.sample(n_samples=5)

Generate new synthetic samples conditionally (using custom context and seed):

ctx_data = pd.DataFrame({
    "id": ["Player1", "Player2"],
    "weight": [170, 160],
    "height": [70, 68],
    "bats": ["R", "L"],
    "throws": ["R", "L"],
})
argn.sample(ctx_data=ctx_data)

Basic Usage of LanguageModel

The LanguageModel class provides a scikit-learn-compatible interface for working with semi-structured textual data. It leverages pre-trained language models or trains lightweight LSTM models from scratch to generate synthetic text data.

Note: The default model is MOSTLY_AI/LSTMFromScratch-3m, a lightweight LSTM model trained from scratch (GPU strongly recommended). You can also use pretrained Hugging Face models (model="<hub/repo>"; GPU required). Verified checkpoints include HuggingFaceTB/SmolLM2-135M, HuggingFaceTB/SmolLM3-3B, Qwen/Qwen3-0.6B, and microsoft/phi-4.

Model Training

Load your data and train the model:

import pandas as pd
from mostlyai.engine import LanguageModel

# load data
data = pd.read_csv("https://github.com/user-attachments/files/23486562/airbnb20k.csv.gz")

# fit LanguageModel
lm = LanguageModel(
    model="MOSTLY_AI/LSTMFromScratch-3m",
    tgt_encoding_types={
        'neighbourhood': 'LANGUAGE_CATEGORICAL',
        'title': 'LANGUAGE_TEXT',
    },
    max_training_time=10,  # 10 minutes
    verbose=1,
)
lm.fit(data)

Sampling / Synthetic Text Generation

Generate new synthetic samples using the trained language model:

# unconditional sampling
lm.sample(
    n_samples=100,
    sampling_temperature=0.8,
)
# prepare seed
seed_data = pd.DataFrame({
    "neighbourhood": ["Westminster", "Hackney"],
})

# conditional sampling with seed values
lm.sample(
    seed_data=seed_data,
    sampling_temperature=0.8,
)

Further Examples

Example notebooks demonstrating various use cases are available in the examples directory:

  • TabularARGN for flat tabular data Run on Colab
  • TabularARGN for sequential data Run on Colab
  • LanguageModel for textual data Run on Colab

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

mostlyai_engine-2.6.0.tar.gz (140.8 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

mostlyai_engine-2.6.0-py3-none-any.whl (184.8 kB view details)

Uploaded Python 3

File details

Details for the file mostlyai_engine-2.6.0.tar.gz.

File metadata

  • Download URL: mostlyai_engine-2.6.0.tar.gz
  • Upload date:
  • Size: 140.8 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.12

File hashes

Hashes for mostlyai_engine-2.6.0.tar.gz
Algorithm Hash digest
SHA256 3dd888e6c6358d127c0731d1475a610540ad84621b24dc97cad3628c34a366d2
MD5 475951c1d37aceb379d1315f0073277a
BLAKE2b-256 00190b409b39a3528f676edde72a05874f876655028138a77691f8d999e39cbd

See more details on using hashes here.

Provenance

The following attestation bundles were made for mostlyai_engine-2.6.0.tar.gz:

Publisher: release-2-publish.yml on mostly-ai/mostlyai-engine

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file mostlyai_engine-2.6.0-py3-none-any.whl.

File metadata

  • Download URL: mostlyai_engine-2.6.0-py3-none-any.whl
  • Upload date:
  • Size: 184.8 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.12

File hashes

Hashes for mostlyai_engine-2.6.0-py3-none-any.whl
Algorithm Hash digest
SHA256 9eb247029638fce715af3fc18edda9b7d518125435b29e2a68620702fefdb74f
MD5 420ec011f939f5dd8a195e179566714a
BLAKE2b-256 7a36b5e43f7642b26e8f3ec28627089f15d1da7f0aeb8f4a2e2a90b6d16d4576

See more details on using hashes here.

Provenance

The following attestation bundles were made for mostlyai_engine-2.6.0-py3-none-any.whl:

Publisher: release-2-publish.yml on mostly-ai/mostlyai-engine

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page