Skip to main content

Datacrafter — AI-based, schema-driven synthetic data generator with a plugin architecture.

Project description

Datacrafter

AI-based, schema-driven synthetic data generator with a plugin architecture.

Design datasets in YAML, generate realistic CSV / JSON / JSONL / XML / Parquet files, and extend functionality with custom providers or writers — no core changes required.


✨ Features

  • Schema-driven – Define structure, constraints, and output using YAML
  • Deterministic – Use seed for reproducible datasets
  • Rich providers – uuid, integer, float, boolean, categorical, datetime, person., text., geo
  • Advanced controlsunique, null_rate, regex, distributions
  • Templating${first}.${last}@domain.com
  • Multiple formats – CSV, JSON, JSONL, XML, Parquet
  • Plugin architecture – Extend without modifying core
  • CLI + Python API

📦 Installation

pip install datacrafter-ai

Requirements: Python 3.9+


🚀 Quickstart (CLI)

1. Create a schema (examples/simple.yaml)

version: 1
seed: 42
rows: 20

fields:
  id:
    type: uuid

  name:
    type: person.name

  age:
    type: integer
    params:
      min: 18
      max: 60

output:
  format: csv
  path: ./output/simple.csv

2. Generate data

datacrafter generate --schema examples/simple.yaml

3. Output

./output/simple.csv

🧠 Quickstart (Python)

from datacrafter.schema_loader import load_schema
from datacrafter.generator import Generator

schema = load_schema("examples/simple.yaml")

gen = Generator(schema)
rows = gen.generate()
gen.write()

🧾 YAML Schema (v1)

Key Type Required Description
version int Yes Schema version (use 1)
seed int No Deterministic output seed
rows int Yes* Number of rows
fields map Yes* Column definitions
output map Yes* Output configuration
datasets list No Multi-dataset support

*Required when datasets is not used


📌 Field Definition

<column_name>:
  type: <provider.name>
  params: {}
  unique: false
  null_rate: 0.0
  regex: null

  distribution:
    name: normal
    mean: 35
    std: 10
    min: 18
    max: 75

  categorical:
    values: [IN, US, DE]
    weights: [0.6, 0.3, 0.1]

  template: "${first}.${last}@${domain}"
  depends_on: ["first", "last", "domain"]
  transform: ["lower", "strip"]

📤 Output Configuration

output:
  format: csv
  path: ./out/customers.csv
  options:
    delimiter: ","
    header: true
    encoding: "utf-8"

🧩 Built-in Providers

  • IDsuuid, id.incremental
  • Numericinteger, float
  • Booleanboolean
  • Texttext.lorem, text.short, text.word, string.regex
  • Personperson.*
  • Datetimedatetime
  • Categoricalcategorical
  • Geogeo.country

🎛️ Constraints & Validation

  • unique → Enforces uniqueness
  • null_rate → Probability of null values
  • regex → Validation
  • distribution → Statistical control
  • template → Field composition
  • depends_on → Dependency ordering

🖥️ CLI Reference

# Generate data
datacrafter generate --schema schema.yaml

# Validate schema
datacrafter validate --schema schema.yaml

# List providers & writers
datacrafter list providers
datacrafter list writers

# Create starter schema
datacrafter init --template minimal

🔌 Plugins

Install external plugins:

pip install datacrafter-healthcare
pip install datacrafter-parquet-writer

Example plugin registration

[project.entry-points."datacrafter.providers"]
health = "dc_health.providers:register"

[project.entry-points."datacrafter.writers"]
parquet = "dc_parquet.writer:register"

🧪 Example Schemas

Customers (CSV)

version: 1
rows: 5000

fields:
  id: { type: uuid, unique: true }
  first: { type: person.first_name }
  last:  { type: person.last_name }

output:
  format: csv
  path: ./out/customers.csv

Events (JSONL)

version: 1
rows: 10000

fields:
  event_id: { type: uuid, unique: true }
  user_id:  { type: id.incremental }

output:
  format: jsonl
  path: ./out/events.jsonl

Articles (XML)

version: 1
rows: 200

fields:
  uid: { type: uuid }

output:
  format: xml
  path: ./out/articles.xml

🛠️ Troubleshooting

  • PyPI name conflict → Change project name
  • Determinism issues → Set seed
  • Unique errors → Increase domain size
  • Performance issues → Use chunking

📦 Development

python -m pip install --upgrade build twine
python -m build
twine check dist/*

Publish

twine upload dist/*

🔒 License

MIT © 2026 Mahalakshmi Shanmuga Sundaram


🏢 About

Datacrafter is developed and maintained by DHS Tech Services.


🙌 Acknowledgements

Inspired by modern synthetic data generation and schema-driven design.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

datacrafter_ai-0.1.0.tar.gz (23.2 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

datacrafter_ai-0.1.0-py3-none-any.whl (30.6 kB view details)

Uploaded Python 3

File details

Details for the file datacrafter_ai-0.1.0.tar.gz.

File metadata

  • Download URL: datacrafter_ai-0.1.0.tar.gz
  • Upload date:
  • Size: 23.2 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.13.7

File hashes

Hashes for datacrafter_ai-0.1.0.tar.gz
Algorithm Hash digest
SHA256 594e676f6bc453080163e709b6fbdabfa87387de2b8ea14a8a9cf6c373508cc9
MD5 3c309bc0f97971d079ef4f25e4f6dccf
BLAKE2b-256 7667b355391e6cbf62adb8aba94a00b3d16d662b8d6ba5b30dd6e7fd58bfd49f

See more details on using hashes here.

File details

Details for the file datacrafter_ai-0.1.0-py3-none-any.whl.

File metadata

  • Download URL: datacrafter_ai-0.1.0-py3-none-any.whl
  • Upload date:
  • Size: 30.6 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.13.7

File hashes

Hashes for datacrafter_ai-0.1.0-py3-none-any.whl
Algorithm Hash digest
SHA256 cc44ca50b2d200b8e414f6799ed848b54c574c77b8b13f519faeeacb07034b1b
MD5 06d5c169eb1b9b84ddc9ea5be9d74b1d
BLAKE2b-256 ef0c9b1d0198b2d643c0e80ba51c2c34884eb6fc39f0449163bd8377145268e4

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page