LLM-based synthetic dataset generation
Project description
makeitup
Generate synthetic datasets for ML training using LLM. Describe your columns in plain English and get realistic data back.
from makeitup import make
df = make(
columns={
"name": "Person's full name",
"age": "Age between 25 and 55",
"email": "Work email address",
},
num_rows=100
)
Features
- Plain English columns - Describe what you want, get realistic data back
- ML-ready datasets - Add target columns for classification or regression
- Data quality testing - Inject nulls, outliers, typos, or duplicates to test your pipelines
- Multiple formats - Export to CSV, JSON, Parquet, or Excel
- Local model support - Works with OpenAI, Ollama, vLLM, LMStudio, and any OpenAI-compatible API
Installation
pip install makeitup
Set your OpenAI API key:
export OPENAI_API_KEY=your-api-key
Or create a .env file in your project with OPENAI_API_KEY=your-api-key.
Using a Local Model
You can use locally deployed models (Ollama, vLLM, LMStudio, etc.) by setting the base URL:
export LLM_BASE_URL=http://localhost:11434/v1
export LLM_MODEL=llama3
export LLM_API_KEY=not-needed # Required by some local servers
Examples
Basic Data
from makeitup import make
# Customer data
df = make(
columns={
"customer_id": "Unique customer identifier",
"name": "Customer full name",
"email": "Email address",
"signup_date": "Date when customer signed up, 2020-2024",
},
num_rows=100
)
ML Dataset with Target Column
df = make(
columns={
"tenure_months": "Months as customer, 1-60",
"monthly_spend": "Monthly spending in USD, 10-500",
"support_tickets": "Number of support tickets, 0-10",
},
target={
"name": "churned",
"prompt": "Boolean indicating if customer churned"
},
num_rows=500
)
Data Quality Degradation
# Generate dataset with intentional quality issues for testing data pipelines
df = make(
columns={
"name": "Person's full name",
"age": "Age between 20 and 60",
"salary": "Annual salary in USD, 30000-150000",
},
num_rows=100,
quality_issues=["nulls", "outliers"], # Options: nulls, outliers, typos, duplicates
)
Save to File
# CSV, JSON, Parquet, or Excel - format detected from extension
df = make(
columns={"name": "Product name", "price": "Price in USD, 10-1000"},
num_rows=200,
output_path="products.csv"
)
Output Formats
| Format | Extension |
|---|---|
| CSV | .csv |
| JSON | .json |
| Parquet | .parquet |
| Excel | .xlsx |
Requirements
- Python >= 3.12
- OpenAI API key or a local model (Ollama, vLLM, etc.)
Documentation
See DEVELOPER.md for technical details, API reference, and development setup.
License
See LICENSE file for details.
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file makeitup-0.1.1.tar.gz.
File metadata
- Download URL: makeitup-0.1.1.tar.gz
- Upload date:
- Size: 12.7 kB
- Tags: Source
- Uploaded using Trusted Publishing? Yes
- Uploaded via: twine/6.1.0 CPython/3.13.7
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
918afd4dfa999d1fe1838e91bf232644275219ac9276e51cc4f74ea3f5a5968e
|
|
| MD5 |
a8c2b118cc52ba657c59c515942b2834
|
|
| BLAKE2b-256 |
ef436f204d6e537265b8209d848fd6d7b1793c8fd04e8174a8a5411668d881e3
|
Provenance
The following attestation bundles were made for makeitup-0.1.1.tar.gz:
Publisher:
publish.yml on tkopczynski/makeitup
-
Statement:
-
Statement type:
https://in-toto.io/Statement/v1 -
Predicate type:
https://docs.pypi.org/attestations/publish/v1 -
Subject name:
makeitup-0.1.1.tar.gz -
Subject digest:
918afd4dfa999d1fe1838e91bf232644275219ac9276e51cc4f74ea3f5a5968e - Sigstore transparency entry: 715622547
- Sigstore integration time:
-
Permalink:
tkopczynski/makeitup@3601064353476fdefb1239f7ff65d8e2460adae0 -
Branch / Tag:
refs/tags/v0.1.1 - Owner: https://github.com/tkopczynski
-
Access:
public
-
Token Issuer:
https://token.actions.githubusercontent.com -
Runner Environment:
github-hosted -
Publication workflow:
publish.yml@3601064353476fdefb1239f7ff65d8e2460adae0 -
Trigger Event:
release
-
Statement type:
File details
Details for the file makeitup-0.1.1-py3-none-any.whl.
File metadata
- Download URL: makeitup-0.1.1-py3-none-any.whl
- Upload date:
- Size: 10.6 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? Yes
- Uploaded via: twine/6.1.0 CPython/3.13.7
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
d3699b83b30879bd12db323dc382882224817108b2ba49e93fc1e11d0283899b
|
|
| MD5 |
e903809d03a38bfc52aee811a59e774d
|
|
| BLAKE2b-256 |
0d2866ab9561be0e9cc8575d3cddcccc566860fd2974fce2219b608c9a685489
|
Provenance
The following attestation bundles were made for makeitup-0.1.1-py3-none-any.whl:
Publisher:
publish.yml on tkopczynski/makeitup
-
Statement:
-
Statement type:
https://in-toto.io/Statement/v1 -
Predicate type:
https://docs.pypi.org/attestations/publish/v1 -
Subject name:
makeitup-0.1.1-py3-none-any.whl -
Subject digest:
d3699b83b30879bd12db323dc382882224817108b2ba49e93fc1e11d0283899b - Sigstore transparency entry: 715622550
- Sigstore integration time:
-
Permalink:
tkopczynski/makeitup@3601064353476fdefb1239f7ff65d8e2460adae0 -
Branch / Tag:
refs/tags/v0.1.1 - Owner: https://github.com/tkopczynski
-
Access:
public
-
Token Issuer:
https://token.actions.githubusercontent.com -
Runner Environment:
github-hosted -
Publication workflow:
publish.yml@3601064353476fdefb1239f7ff65d8e2460adae0 -
Trigger Event:
release
-
Statement type: