Datacrafter — AI-based, schema-driven synthetic data generator with a plugin architecture.

These details have not been verified by PyPI

Project description

Datacrafter

AI-based, schema-driven synthetic data generator with a plugin architecture.

Design datasets in YAML, generate realistic CSV / JSON / JSONL / XML / Parquet files, and extend functionality with custom providers or writers — no core changes required.

✨ Features

Schema-driven – Define structure, constraints, and output using YAML
Deterministic – Use seed for reproducible datasets
Rich providers – uuid, integer, float, boolean, categorical, datetime, person., text., geo
Advanced controls – unique, null_rate, regex, distributions
Templating – ${first}.${last}@domain.com
Multiple formats – CSV, JSON, JSONL, XML, Parquet
Plugin architecture – Extend without modifying core
CLI + Python API

📦 Installation

pip install datacrafter-ai

Requirements: Python 3.9+

🚀 Quickstart (CLI)

1. Create a schema (`examples/simple.yaml`)

version: 1
seed: 42
rows: 20

fields:
  id:
    type: uuid

  name:
    type: person.name

  age:
    type: integer
    params:
      min: 18
      max: 60

output:
  format: csv
  path: ./output/simple.csv

2. Generate data

datacrafter generate --schema examples/simple.yaml

3. Output

./output/simple.csv

🧠 Quickstart (Python)

from datacrafter.schema_loader import load_schema
from datacrafter.generator import Generator

schema = load_schema("examples/simple.yaml")

gen = Generator(schema)
rows = gen.generate()
gen.write()

🧾 YAML Schema (v1)

Key	Type	Required	Description
version	int	Yes	Schema version (use 1)
seed	int	No	Deterministic output seed
rows	int	Yes*	Number of rows
fields	map	Yes*	Column definitions
output	map	Yes*	Output configuration
datasets	list	No	Multi-dataset support

*Required when datasets is not used

📌 Field Definition

<column_name>:
  type: <provider.name>
  params: {}
  unique: false
  null_rate: 0.0
  regex: null

  distribution:
    name: normal
    mean: 35
    std: 10
    min: 18
    max: 75

  categorical:
    values: [IN, US, DE]
    weights: [0.6, 0.3, 0.1]

  template: "${first}.${last}@${domain}"
  depends_on: ["first", "last", "domain"]
  transform: ["lower", "strip"]

📤 Output Configuration

output:
  format: csv
  path: ./out/customers.csv
  options:
    delimiter: ","
    header: true
    encoding: "utf-8"

🧩 Built-in Providers

IDs → uuid, id.incremental
Numeric → integer, float
Boolean → boolean
Text → text.lorem, text.short, text.word, string.regex
Person → person.*
Datetime → datetime
Categorical → categorical
Geo → geo.country

🎛️ Constraints & Validation

unique → Enforces uniqueness
null_rate → Probability of null values
regex → Validation
distribution → Statistical control
template → Field composition
depends_on → Dependency ordering

🖥️ CLI Reference

# Generate data
datacrafter generate --schema schema.yaml

# Validate schema
datacrafter validate --schema schema.yaml

# List providers & writers
datacrafter list providers
datacrafter list writers

# Create starter schema
datacrafter init --template minimal

🔌 Plugins

Install external plugins:

pip install datacrafter-healthcare
pip install datacrafter-parquet-writer

Example plugin registration

[project.entry-points."datacrafter.providers"]
health = "dc_health.providers:register"

[project.entry-points."datacrafter.writers"]
parquet = "dc_parquet.writer:register"

🧪 Example Schemas

Customers (CSV)

version: 1
rows: 5000

fields:
  id: { type: uuid, unique: true }
  first: { type: person.first_name }
  last:  { type: person.last_name }

output:
  format: csv
  path: ./out/customers.csv

Events (JSONL)

version: 1
rows: 10000

fields:
  event_id: { type: uuid, unique: true }
  user_id:  { type: id.incremental }

output:
  format: jsonl
  path: ./out/events.jsonl

Articles (XML)

version: 1
rows: 200

fields:
  uid: { type: uuid }

output:
  format: xml
  path: ./out/articles.xml

🛠️ Troubleshooting

PyPI name conflict → Change project name
Determinism issues → Set seed
Unique errors → Increase domain size
Performance issues → Use chunking

📦 Development

python -m pip install --upgrade build twine
python -m build
twine check dist/*

Publish

twine upload dist/*

🔒 License

🏢 About

Datacrafter is developed and maintained by DHS Tech Services.

🙌 Acknowledgements

Inspired by modern synthetic data generation and schema-driven design.

Project details

These details have not been verified by PyPI

Release history Release notifications | RSS feed

1.0.1

Apr 9, 2026

1.0.0

Apr 3, 2026

This version

0.1.0

Mar 27, 2026

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

datacrafter_ai-0.1.0.tar.gz (23.2 kB view details)

Uploaded Mar 27, 2026 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

datacrafter_ai-0.1.0-py3-none-any.whl (30.6 kB view details)

Uploaded Mar 27, 2026 Python 3

File details

Details for the file datacrafter_ai-0.1.0.tar.gz.

File metadata

Download URL: datacrafter_ai-0.1.0.tar.gz
Upload date: Mar 27, 2026
Size: 23.2 kB
Tags: Source
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.2.0 CPython/3.13.7

File hashes

Hashes for datacrafter_ai-0.1.0.tar.gz
Algorithm	Hash digest
SHA256	`594e676f6bc453080163e709b6fbdabfa87387de2b8ea14a8a9cf6c373508cc9`
MD5	`3c309bc0f97971d079ef4f25e4f6dccf`
BLAKE2b-256	`7667b355391e6cbf62adb8aba94a00b3d16d662b8d6ba5b30dd6e7fd58bfd49f`

See more details on using hashes here.

File details

Details for the file datacrafter_ai-0.1.0-py3-none-any.whl.

File metadata

Download URL: datacrafter_ai-0.1.0-py3-none-any.whl
Upload date: Mar 27, 2026
Size: 30.6 kB
Tags: Python 3
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.2.0 CPython/3.13.7

File hashes

Hashes for datacrafter_ai-0.1.0-py3-none-any.whl
Algorithm	Hash digest
SHA256	`cc44ca50b2d200b8e414f6799ed848b54c574c77b8b13f519faeeacb07034b1b`
MD5	`06d5c169eb1b9b84ddc9ea5be9d74b1d`
BLAKE2b-256	`ef0c9b1d0198b2d643c0e80ba51c2c34884eb6fc39f0449163bd8377145268e4`

See more details on using hashes here.

datacrafter-ai 0.1.0

Navigation

Verified details

Maintainers

Unverified details

Meta

Classifiers

Project description

Datacrafter

✨ Features

📦 Installation

🚀 Quickstart (CLI)

1. Create a schema (examples/simple.yaml)

2. Generate data

3. Output

🧠 Quickstart (Python)

🧾 YAML Schema (v1)

📌 Field Definition

📤 Output Configuration

🧩 Built-in Providers

🎛️ Constraints & Validation

🖥️ CLI Reference

🔌 Plugins

Example plugin registration

🧪 Example Schemas

Customers (CSV)

Events (JSONL)

Articles (XML)

🛠️ Troubleshooting

📦 Development

Publish

🔒 License

🏢 About

🙌 Acknowledgements

Project details

Verified details

Maintainers

Unverified details

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

File details

File metadata

File hashes

1. Create a schema (`examples/simple.yaml`)