Skip to main content

LLM-based synthetic dataset generation

Project description

makeitup

PyPI version

Generate synthetic datasets for ML training using LLM. Describe your columns in plain English and get realistic data back.

from makeitup import make

df = make(
    columns={
        "name": "Person's full name",
        "age": "Age between 25 and 55",
        "email": "Work email address",
    },
    num_rows=100
)

Features

  • Plain English columns - Describe what you want, get realistic data back
  • ML-ready datasets - Add target columns for classification or regression
  • Data quality testing - Inject nulls, outliers, typos, or duplicates to test your pipelines
  • Multiple formats - Export to CSV, JSON, Parquet, or Excel
  • Local model support - Works with OpenAI and any OpenAI-compatible API that supports structured output

Installation

pip install makeitup

Set your OpenAI API key:

export OPENAI_API_KEY=your-api-key

Or create a .env file in your project with OPENAI_API_KEY=your-api-key.

Using a Local Model

makeitup uses structured output to ensure reliable data generation. Local models must support OpenAI-compatible structured output (JSON schema enforcement).

Supported local setups:

  • llama.cpp with function calling enabled (llama-server, LM Studio)
  • vLLM with --enable-auto-tool-choice
  • Ollama (version 0.3.0+) - newer models like llama3.1, qwen2.5
  • Any OpenAI-compatible API that implements structured output

Example configuration:

export LLM_BASE_URL=http://localhost:11434/v1  # Ollama
export LLM_MODEL=llama3.1
export LLM_API_KEY=not-needed  # Required by some local servers

Note: Not all local models support structured output. If you encounter errors, verify your model and server support JSON schema enforcement.

Examples

Basic Data

from makeitup import make

# Customer data
df = make(
    columns={
        "customer_id": "Unique customer identifier",
        "name": "Customer full name",
        "email": "Email address",
        "signup_date": "Date when customer signed up, 2020-2024",
    },
    num_rows=100
)

ML Dataset with Target Column

df = make(
    columns={
        "tenure_months": "Months as customer, 1-60",
        "monthly_spend": "Monthly spending in USD, 10-500",
        "support_tickets": "Number of support tickets, 0-10",
    },
    target={
        "name": "churned",
        "prompt": "Boolean indicating if customer churned"
    },
    num_rows=500
)

Data Quality Degradation

# Generate dataset with intentional quality issues for testing data pipelines
df = make(
    columns={
        "name": "Person's full name",
        "age": "Age between 20 and 60",
        "salary": "Annual salary in USD, 30000-150000",
    },
    num_rows=100,
    quality_issues=["nulls", "outliers"],  # Options: nulls, outliers, typos, duplicates
)

Save to File

# CSV, JSON, Parquet, or Excel - format detected from extension
df = make(
    columns={"name": "Product name", "price": "Price in USD, 10-1000"},
    num_rows=200,
    output_path="products.csv"
)

Output Formats

Format Extension
CSV .csv
JSON .json
Parquet .parquet
Excel .xlsx

Requirements

  • Python >= 3.12
  • OpenAI API key or a local model that supports structured output (see "Using a Local Model" above)

Documentation

See DEVELOPER.md for technical details, API reference, and development setup.

License

See LICENSE file for details.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

makeitup-0.2.0.tar.gz (13.3 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

makeitup-0.2.0-py3-none-any.whl (10.7 kB view details)

Uploaded Python 3

File details

Details for the file makeitup-0.2.0.tar.gz.

File metadata

  • Download URL: makeitup-0.2.0.tar.gz
  • Upload date:
  • Size: 13.3 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.7

File hashes

Hashes for makeitup-0.2.0.tar.gz
Algorithm Hash digest
SHA256 8601b1ac4f9c5af415942b5b6a2430feb044be3add42f540e52f843659b3675a
MD5 75649171a0596becee0dc5bbf54fd743
BLAKE2b-256 70bd3bf735e1613e7d151a5ba82832f2d903044766bc95f784b01fcc8bc8a7f4

See more details on using hashes here.

Provenance

The following attestation bundles were made for makeitup-0.2.0.tar.gz:

Publisher: publish.yml on tkopczynski/makeitup

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file makeitup-0.2.0-py3-none-any.whl.

File metadata

  • Download URL: makeitup-0.2.0-py3-none-any.whl
  • Upload date:
  • Size: 10.7 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.7

File hashes

Hashes for makeitup-0.2.0-py3-none-any.whl
Algorithm Hash digest
SHA256 5b89f1ac3faf1654b473322d4118650c4b846168ad262f0150280082547acc29
MD5 6bbb66a8dd13220f62bc36594c8bed6a
BLAKE2b-256 50fca8c72f0fc5e55dd7c2919f0527af86ae80f97976cdfdd570e2827a3c1477

See more details on using hashes here.

Provenance

The following attestation bundles were made for makeitup-0.2.0-py3-none-any.whl:

Publisher: publish.yml on tkopczynski/makeitup

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page