Datacrafter — AI-based, schema-driven synthetic data generator with a plugin architecture.
Project description
Datacrafter
AI-based, schema-driven synthetic data generator with a plugin architecture.
Design datasets in YAML, generate realistic CSV / JSON / JSONL / XML / Parquet files, and extend functionality with custom providers or writers — no core changes required.
✨ Features
- Schema-driven – Define structure, constraints, and output using YAML
- Deterministic – Use
seedfor reproducible datasets - Rich providers – uuid, integer, float, boolean, categorical, datetime, person., text., geo
- Advanced controls –
unique,null_rate,regex, distributions - Templating –
${first}.${last}@domain.com - Multiple formats – CSV, JSON, JSONL, XML, Parquet
- Plugin architecture – Extend without modifying core
- CLI + Python API
📦 Installation
pip install datacrafter-ai
Requirements: Python 3.9+
🚀 Quickstart (CLI)
1. Create a schema (examples/simple.yaml)
version: 1
seed: 42
rows: 20
fields:
id:
type: uuid
name:
type: person.name
age:
type: integer
params:
min: 18
max: 60
output:
format: csv
path: ./output/simple.csv
2. Generate data
datacrafter generate --schema examples/simple.yaml
3. Output
./output/simple.csv
🧠 Quickstart (Python)
from datacrafter.schema_loader import load_schema
from datacrafter.generator import Generator
schema = load_schema("examples/simple.yaml")
gen = Generator(schema)
rows = gen.generate()
gen.write()
🧾 YAML Schema (v1)
| Key | Type | Required | Description |
|---|---|---|---|
| version | int | Yes | Schema version (use 1) |
| seed | int | No | Deterministic output seed |
| rows | int | Yes* | Number of rows |
| fields | map | Yes* | Column definitions |
| output | map | Yes* | Output configuration |
| datasets | list | No | Multi-dataset support |
*Required when
datasetsis not used
📌 Field Definition
<column_name>:
type: <provider.name>
params: {}
unique: false
null_rate: 0.0
regex: null
distribution:
name: normal
mean: 35
std: 10
min: 18
max: 75
categorical:
values: [IN, US, DE]
weights: [0.6, 0.3, 0.1]
template: "${first}.${last}@${domain}"
depends_on: ["first", "last", "domain"]
transform: ["lower", "strip"]
📤 Output Configuration
output:
format: csv
path: ./out/customers.csv
options:
delimiter: ","
header: true
encoding: "utf-8"
🧩 Built-in Providers
- IDs →
uuid,id.incremental - Numeric →
integer,float - Boolean →
boolean - Text →
text.lorem,text.short,text.word,string.regex - Person →
person.* - Datetime →
datetime - Categorical →
categorical - Geo →
geo.country
🎛️ Constraints & Validation
unique→ Enforces uniquenessnull_rate→ Probability of null valuesregex→ Validationdistribution→ Statistical controltemplate→ Field compositiondepends_on→ Dependency ordering
🖥️ CLI Reference
# Generate data
datacrafter generate --schema schema.yaml
# Validate schema
datacrafter validate --schema schema.yaml
# List providers & writers
datacrafter list providers
datacrafter list writers
# Create starter schema
datacrafter init --template minimal
🔌 Plugins
Install external plugins:
pip install datacrafter-healthcare
pip install datacrafter-parquet-writer
Example plugin registration
[project.entry-points."datacrafter.providers"]
health = "dc_health.providers:register"
[project.entry-points."datacrafter.writers"]
parquet = "dc_parquet.writer:register"
🧪 Example Schemas
Customers (CSV)
version: 1
rows: 5000
fields:
id: { type: uuid, unique: true }
first: { type: person.first_name }
last: { type: person.last_name }
output:
format: csv
path: ./out/customers.csv
Events (JSONL)
version: 1
rows: 10000
fields:
event_id: { type: uuid, unique: true }
user_id: { type: id.incremental }
output:
format: jsonl
path: ./out/events.jsonl
Articles (XML)
version: 1
rows: 200
fields:
uid: { type: uuid }
output:
format: xml
path: ./out/articles.xml
🛠️ Troubleshooting
- PyPI name conflict → Change project name
- Determinism issues → Set
seed - Unique errors → Increase domain size
- Performance issues → Use chunking
📦 Development
python -m pip install --upgrade build twine
python -m build
twine check dist/*
Publish
twine upload dist/*
🔒 License
MIT © 2026 Mahalakshmi Shanmuga Sundaram
🏢 About
Datacrafter is developed and maintained by DHS Tech Services.
🙌 Acknowledgements
Inspired by modern synthetic data generation and schema-driven design.
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file datacrafter_ai-0.1.0.tar.gz.
File metadata
- Download URL: datacrafter_ai-0.1.0.tar.gz
- Upload date:
- Size: 23.2 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.13.7
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
594e676f6bc453080163e709b6fbdabfa87387de2b8ea14a8a9cf6c373508cc9
|
|
| MD5 |
3c309bc0f97971d079ef4f25e4f6dccf
|
|
| BLAKE2b-256 |
7667b355391e6cbf62adb8aba94a00b3d16d662b8d6ba5b30dd6e7fd58bfd49f
|
File details
Details for the file datacrafter_ai-0.1.0-py3-none-any.whl.
File metadata
- Download URL: datacrafter_ai-0.1.0-py3-none-any.whl
- Upload date:
- Size: 30.6 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.13.7
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
cc44ca50b2d200b8e414f6799ed848b54c574c77b8b13f519faeeacb07034b1b
|
|
| MD5 |
06d5c169eb1b9b84ddc9ea5be9d74b1d
|
|
| BLAKE2b-256 |
ef0c9b1d0198b2d643c0e80ba51c2c34884eb6fc39f0449163bd8377145268e4
|