LLM-based synthetic dataset generation
Project description
makeitup
Generate synthetic datasets using LLM. Describe your columns in plain English and get realistic data back.
from makeitup import make
df = make(
columns={
"name": "Person's full name",
"age": "Age between 25 and 55",
"email": "Work email address",
},
num_rows=100
)
Quick Start
# Install
uv venv && source .venv/bin/activate
uv pip install -e .
# Configure
cp .env.example .env
# Add your OpenAI API key to .env
Examples
Basic Data
from makeitup import make
# Customer data
df = make(
columns={
"customer_id": "Unique customer identifier",
"name": "Customer full name",
"email": "Email address",
"signup_date": "Date when customer signed up, 2020-2024",
},
num_rows=100
)
ML Dataset with Target Column
df = make(
columns={
"tenure_months": "Months as customer, 1-60",
"monthly_spend": "Monthly spending in USD, 10-500",
"support_tickets": "Number of support tickets, 0-10",
},
target={
"name": "churned",
"prompt": "Boolean indicating if customer churned"
},
num_rows=500
)
Data Quality Degradation
# Generate dataset with intentional quality issues for testing data pipelines
df = make(
columns={
"name": "Person's full name",
"age": "Age between 20 and 60",
"salary": "Annual salary in USD, 30000-150000",
},
num_rows=100,
quality_issues=["nulls", "outliers"], # Options: nulls, outliers, typos, duplicates
)
Save to File
# CSV, JSON, Parquet, or Excel - format detected from extension
df = make(
columns={"name": "Product name", "price": "Price in USD, 10-1000"},
num_rows=200,
output_path="products.csv"
)
Output Formats
| Format | Extension |
|---|---|
| CSV | .csv |
| JSON | .json |
| Parquet | .parquet |
| Excel | .xlsx |
Requirements
- Python >= 3.12
- OpenAI API key
Documentation
See DEVELOPER.md for technical details, API reference, and development setup.
License
See LICENSE file for details.
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
makeitup-0.1.0.tar.gz
(12.4 kB
view details)
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
makeitup-0.1.0-py3-none-any.whl
(10.2 kB
view details)
File details
Details for the file makeitup-0.1.0.tar.gz.
File metadata
- Download URL: makeitup-0.1.0.tar.gz
- Upload date:
- Size: 12.4 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.12.12
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
df6d6f0fd180ae1fbb3a9e0a1e6e56e14b6ed442378553309cba5f8841cc55d1
|
|
| MD5 |
4209bf39f9954c52fbe78f30c52e8a2a
|
|
| BLAKE2b-256 |
5404aa8c6f55d6471570c09aac6633a0c9035ff9575fc3c777365ed852e865b8
|
File details
Details for the file makeitup-0.1.0-py3-none-any.whl.
File metadata
- Download URL: makeitup-0.1.0-py3-none-any.whl
- Upload date:
- Size: 10.2 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.12.12
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
06225a28b22595181188700f06ea2a686e2a242a846412c892d7dfa1623b7b3a
|
|
| MD5 |
85d4fec903cfce9fedc335325a9da557
|
|
| BLAKE2b-256 |
4e6c09bb3d880add2b5823c6913df0652e9bc1fcac07c51154aba736fed33a96
|