Skip to main content

LLM-based synthetic dataset generation

Project description

makeitup

Generate synthetic datasets using LLM. Describe your columns in plain English and get realistic data back.

from makeitup import make

df = make(
    columns={
        "name": "Person's full name",
        "age": "Age between 25 and 55",
        "email": "Work email address",
    },
    num_rows=100
)

Quick Start

# Install
uv venv && source .venv/bin/activate
uv pip install -e .

# Configure
cp .env.example .env
# Add your OpenAI API key to .env

Examples

Basic Data

from makeitup import make

# Customer data
df = make(
    columns={
        "customer_id": "Unique customer identifier",
        "name": "Customer full name",
        "email": "Email address",
        "signup_date": "Date when customer signed up, 2020-2024",
    },
    num_rows=100
)

ML Dataset with Target Column

df = make(
    columns={
        "tenure_months": "Months as customer, 1-60",
        "monthly_spend": "Monthly spending in USD, 10-500",
        "support_tickets": "Number of support tickets, 0-10",
    },
    target={
        "name": "churned",
        "prompt": "Boolean indicating if customer churned"
    },
    num_rows=500
)

Data Quality Degradation

# Generate dataset with intentional quality issues for testing data pipelines
df = make(
    columns={
        "name": "Person's full name",
        "age": "Age between 20 and 60",
        "salary": "Annual salary in USD, 30000-150000",
    },
    num_rows=100,
    quality_issues=["nulls", "outliers"],  # Options: nulls, outliers, typos, duplicates
)

Save to File

# CSV, JSON, Parquet, or Excel - format detected from extension
df = make(
    columns={"name": "Product name", "price": "Price in USD, 10-1000"},
    num_rows=200,
    output_path="products.csv"
)

Output Formats

Format Extension
CSV .csv
JSON .json
Parquet .parquet
Excel .xlsx

Requirements

  • Python >= 3.12
  • OpenAI API key

Documentation

See DEVELOPER.md for technical details, API reference, and development setup.

License

See LICENSE file for details.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

makeitup-0.1.0.tar.gz (12.4 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

makeitup-0.1.0-py3-none-any.whl (10.2 kB view details)

Uploaded Python 3

File details

Details for the file makeitup-0.1.0.tar.gz.

File metadata

  • Download URL: makeitup-0.1.0.tar.gz
  • Upload date:
  • Size: 12.4 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.12.12

File hashes

Hashes for makeitup-0.1.0.tar.gz
Algorithm Hash digest
SHA256 df6d6f0fd180ae1fbb3a9e0a1e6e56e14b6ed442378553309cba5f8841cc55d1
MD5 4209bf39f9954c52fbe78f30c52e8a2a
BLAKE2b-256 5404aa8c6f55d6471570c09aac6633a0c9035ff9575fc3c777365ed852e865b8

See more details on using hashes here.

File details

Details for the file makeitup-0.1.0-py3-none-any.whl.

File metadata

  • Download URL: makeitup-0.1.0-py3-none-any.whl
  • Upload date:
  • Size: 10.2 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.12.12

File hashes

Hashes for makeitup-0.1.0-py3-none-any.whl
Algorithm Hash digest
SHA256 06225a28b22595181188700f06ea2a686e2a242a846412c892d7dfa1623b7b3a
MD5 85d4fec903cfce9fedc335325a9da557
BLAKE2b-256 4e6c09bb3d880add2b5823c6913df0652e9bc1fcac07c51154aba736fed33a96

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page