Skip to main content

Synthetic Data Engine

Project description

Synthetic Data Engine 💎

GitHub Release Documentation stats license PyPI - Python Version

Documentation | Technical Paper | Free Cloud Service

Create high-fidelity privacy-safe synthetic data:

  1. train a generative model once:
    • train on flat or sequential data
    • control training time & params
    • monitor training progress
    • optionally enable differential privacy
    • optionally provide context data
  2. generate synthetic data samples to your needs:
    • up-sample / down-sample
    • conditionally generate
    • rebalance categories
    • impute missing values
    • incorporate fairness
    • adjust sampling temperature
    • predict / classify / regress
    • detect outliers / anomalies
    • and more

...all within your own compute environment, all with a few lines of Python code 💥.

Note: Models only need to be trained once and can then be flexibly reused for various downstream tasks — such as regression, classification, imputation, or sampling — without the need for retraining.

Two model classes with these methods are available:

  1. TabularARGN(): For structured, flat or sequential tabular data.
    • argn.fit(data): Train a TabularARGN model
    • argn.sample(n_samples): Generate samples
    • argn.predict(target, n_draws, agg_fn): Predict a feature
    • argn.predict_proba(target, n_draws): Estimate probabilities
    • argn.impute(data): Fill missing values
  2. LanguageModel(): For semi-structured, flat textual tabular data.
    • .fit(data): Train a Language model
    • .sample(n_samples): Generate samples

This library serves as the core model engine for the Synthetic Data SDK. For an easy-to-use, higher-level toolkit, please refer to the SDK.

Installation

It is highly recommended to install the package within a dedicated virtual environment using uv.

The latest release of mostlyai-engine can be installed via uv:

uv pip install -U mostlyai-engine

or alternatively for a GPU setup (needed for LLM finetuning and inference):

uv pip install -U 'mostlyai-engine[gpu]'

On Linux, one can explicitly install the CPU-only variant of torch together with mostlyai-engine:

uv pip install -U torch==2.8.0+cpu torchvision==0.23.0+cpu mostlyai-engine --extra-index-url https://download.pytorch.org/whl/cpu

TabularARGN for Flat Data

The TabularARGN class provides a scikit-learn-compatible interface for working with structured tabular data. It can be used for synthetic data generation, classification, regression, and imputation.

Model Training

Load your data and train the model:

import pandas as pd
from sklearn.model_selection import train_test_split
from mostlyai.engine import TabularARGN

# prepare data
data = pd.read_csv("https://github.com/user-attachments/files/23480587/census10k.csv.gz")
data_train, data_test = train_test_split(data, test_size=0.2)

# fit TabularARGN
argn = TabularARGN()
argn.fit(data_train)

Sampling / Synthetic Data Generation

Generate new synthetic samples:

# unconditional sampling
argn.sample(n_samples=1000)

Generate new synthetic samples conditionally:

# prepare seed
seed_data = pd.DataFrame({
    "age": [25, 50],
    "education": ["Bachelors", "HS-grad"]
})

# conditional sampling
argn.sample(seed_data=seed_data)

Imputation / Filling Gaps

Fill in missing values:

# prepare demo data with missings
data_with_missings = data_test.head(300).reset_index(drop=True)
data_with_missings.loc[0:299, "age"] = pd.NA
data_with_missings.loc[0:199, "race"] = pd.NA
data_with_missings.loc[100:299, "income"] = pd.NA

# impute missing values
argn.impute(data_with_missings)

Predictions / Classification

Predict any categorical target column:

from sklearn.metrics import accuracy_score, roc_auc_score

# predict class labels for a categorical
predictions = argn.predict(data_test, target="income", n_draws=10, agg_fn="mode")

# predict class probabilities for a categorical
probabilities = argn.predict_proba(data_test, target="income", n_draws=10)

# evaluate performance
accuracy = accuracy_score(data_test["income"], predictions)
auc = roc_auc_score(data_test["income"], probabilities[:, 1])
print(f"Accuracy: {accuracy:.3f}, AUC: {auc:.3f}")

Predictions / Regression

Predict any numerical target column:

from sklearn.metrics import mean_absolute_error

# predict target values
predictions = argn.predict(data_test, target="age", n_draws=10, agg_fn="mean")

# evaluate performance
mae = mean_absolute_error(data_test["age"], predictions)
print(f"MAE: {mae:.1f} years")

TabularARGN for Sequential Data

For sequential data (e.g., time series or event logs), specify the context key:

Model Training - With Context Data

import pandas as pd
from mostlyai.engine import TabularARGN

# load sequential data
tgt_data = pd.read_csv("https://github.com/user-attachments/files/23480787/batting.csv.gz")
ctx_data = pd.read_csv("https://github.com/user-attachments/files/23480786/players.csv.gz")

# fit TabularARGN with a context key column
argn = TabularARGN(
    tgt_context_key="players_id",
    ctx_primary_key="id",
    ctx_data=ctx_data,
    max_training_time=2,  # 2 minutes
    verbose=0,
)
argn.fit(tgt_data)

Sampling / Synthetic Data Generation

Generate new synthetic samples (using existing context):

argn.sample(n_samples=5)

Generate new synthetic samples conditionally (using custom context and seed):

ctx_data = pd.DataFrame({
    "id": ["Player1", "Player2"],
    "weight": [170, 160],
    "height": [70, 68],
    "bats": ["R", "L"],
    "throws": ["R", "L"],
})
argn.sample(ctx_data=ctx_data)

Basic Usage of LanguageModel

The LanguageModel class provides a scikit-learn-compatible interface for working with semi-structured textual data. It leverages pre-trained language models or trains lightweight LSTM models from scratch to generate synthetic text data.

Note: The default model is MOSTLY_AI/LSTMFromScratch-3m, a lightweight LSTM model trained from scratch (GPU strongly recommended). You can also use pre-trained HuggingFace models by setting model to e.g. microsoft/phi-1.5 (GPU required).

Model Training

Load your data and train the model:

import pandas as pd
from mostlyai.engine import LanguageModel

# load data
data = pd.read_csv("https://github.com/user-attachments/files/23486562/airbnb20k.csv.gz")

# fit LanguageModel
lm = LanguageModel(
    model="MOSTLY_AI/LSTMFromScratch-3m",
    tgt_encoding_types={
        'neighbourhood': 'LANGUAGE_CATEGORICAL',
        'title': 'LANGUAGE_TEXT',
    },
    max_training_time=10,  # 10 minutes
    verbose=1,
)
lm.fit(data)

Sampling / Synthetic Text Generation

Generate new synthetic samples using the trained language model:

# unconditional sampling
lm.sample(
    n_samples=100,
    sampling_temperature=0.8,
)
# prepare seed
seed_data = pd.DataFrame({
    "neighbourhood": ["Westminster", "Hackney"],
})

# conditional sampling with seed values
lm.sample(
    seed_data=seed_data,
    sampling_temperature=0.8,
)

Further Examples

Example notebooks demonstrating various use cases are available in the examples directory:

  • TabularARGN for flat tabular data Run on Colab
  • TabularARGN for sequential data Run on Colab
  • LanguageModel for textual data Run on Colab

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

mostlyai_engine-2.0.0.tar.gz (129.5 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

mostlyai_engine-2.0.0-py3-none-any.whl (171.0 kB view details)

Uploaded Python 3

File details

Details for the file mostlyai_engine-2.0.0.tar.gz.

File metadata

  • Download URL: mostlyai_engine-2.0.0.tar.gz
  • Upload date:
  • Size: 129.5 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.2.0 CPython/3.12.3

File hashes

Hashes for mostlyai_engine-2.0.0.tar.gz
Algorithm Hash digest
SHA256 154d322ebb1f9066c9deac97f2e3e44256ff089943f3bce225c4f57244eb9000
MD5 bb0b0a65703f82810e887b7442226014
BLAKE2b-256 00da9a3bd5336eed9ae8081acf1b775c270546df61fc3bfc57a07f3819186733

See more details on using hashes here.

File details

Details for the file mostlyai_engine-2.0.0-py3-none-any.whl.

File metadata

  • Download URL: mostlyai_engine-2.0.0-py3-none-any.whl
  • Upload date:
  • Size: 171.0 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.2.0 CPython/3.12.3

File hashes

Hashes for mostlyai_engine-2.0.0-py3-none-any.whl
Algorithm Hash digest
SHA256 8f6e8d8589d7ae6cfcd60f8bb00566cab4dd202306485a875e16e12bb75ac626
MD5 a3ca619726cc67120ed8b393be08cc8a
BLAKE2b-256 e3e416b881abf1b9d3eb3ef13719b41b17a1d1994546447505ea81584ff23636

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page