Skip to main content

LLM-based synthetic dataset generation

Project description

makeitup

PyPI version

Generate synthetic datasets for ML training using LLM. Describe your columns in plain English and get realistic data back.

from makeitup import make

df = make(
    columns={
        "name": "Person's full name",
        "age": "Age between 25 and 55",
        "email": "Work email address",
    },
    num_rows=100
)

Features

  • Plain English columns - Describe what you want, get realistic data back
  • ML-ready datasets - Add target columns for classification or regression
  • Data quality testing - Inject nulls, outliers, typos, or duplicates to test your pipelines
  • Multiple formats - Export to CSV, JSON, Parquet, or Excel
  • Local model support - Works with OpenAI, Ollama, vLLM, LMStudio, and any OpenAI-compatible API

Installation

pip install makeitup

Set your OpenAI API key:

export OPENAI_API_KEY=your-api-key

Or create a .env file in your project with OPENAI_API_KEY=your-api-key.

Using a Local Model

You can use locally deployed models (Ollama, vLLM, LMStudio, etc.) by setting the base URL:

export LLM_BASE_URL=http://localhost:11434/v1
export LLM_MODEL=llama3
export LLM_API_KEY=not-needed  # Required by some local servers

Examples

Basic Data

from makeitup import make

# Customer data
df = make(
    columns={
        "customer_id": "Unique customer identifier",
        "name": "Customer full name",
        "email": "Email address",
        "signup_date": "Date when customer signed up, 2020-2024",
    },
    num_rows=100
)

ML Dataset with Target Column

df = make(
    columns={
        "tenure_months": "Months as customer, 1-60",
        "monthly_spend": "Monthly spending in USD, 10-500",
        "support_tickets": "Number of support tickets, 0-10",
    },
    target={
        "name": "churned",
        "prompt": "Boolean indicating if customer churned"
    },
    num_rows=500
)

Data Quality Degradation

# Generate dataset with intentional quality issues for testing data pipelines
df = make(
    columns={
        "name": "Person's full name",
        "age": "Age between 20 and 60",
        "salary": "Annual salary in USD, 30000-150000",
    },
    num_rows=100,
    quality_issues=["nulls", "outliers"],  # Options: nulls, outliers, typos, duplicates
)

Save to File

# CSV, JSON, Parquet, or Excel - format detected from extension
df = make(
    columns={"name": "Product name", "price": "Price in USD, 10-1000"},
    num_rows=200,
    output_path="products.csv"
)

Output Formats

Format Extension
CSV .csv
JSON .json
Parquet .parquet
Excel .xlsx

Requirements

  • Python >= 3.12
  • OpenAI API key or a local model (Ollama, vLLM, etc.)

Documentation

See DEVELOPER.md for technical details, API reference, and development setup.

License

See LICENSE file for details.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

makeitup-0.1.1.tar.gz (12.7 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

makeitup-0.1.1-py3-none-any.whl (10.6 kB view details)

Uploaded Python 3

File details

Details for the file makeitup-0.1.1.tar.gz.

File metadata

  • Download URL: makeitup-0.1.1.tar.gz
  • Upload date:
  • Size: 12.7 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.7

File hashes

Hashes for makeitup-0.1.1.tar.gz
Algorithm Hash digest
SHA256 918afd4dfa999d1fe1838e91bf232644275219ac9276e51cc4f74ea3f5a5968e
MD5 a8c2b118cc52ba657c59c515942b2834
BLAKE2b-256 ef436f204d6e537265b8209d848fd6d7b1793c8fd04e8174a8a5411668d881e3

See more details on using hashes here.

Provenance

The following attestation bundles were made for makeitup-0.1.1.tar.gz:

Publisher: publish.yml on tkopczynski/makeitup

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file makeitup-0.1.1-py3-none-any.whl.

File metadata

  • Download URL: makeitup-0.1.1-py3-none-any.whl
  • Upload date:
  • Size: 10.6 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.7

File hashes

Hashes for makeitup-0.1.1-py3-none-any.whl
Algorithm Hash digest
SHA256 d3699b83b30879bd12db323dc382882224817108b2ba49e93fc1e11d0283899b
MD5 e903809d03a38bfc52aee811a59e774d
BLAKE2b-256 0d2866ab9561be0e9cc8575d3cddcccc566860fd2974fce2219b608c9a685489

See more details on using hashes here.

Provenance

The following attestation bundles were made for makeitup-0.1.1-py3-none-any.whl:

Publisher: publish.yml on tkopczynski/makeitup

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page