Skip to main content

Datacrafter — AI-based, schema-driven synthetic data generator with a plugin architecture.

Project description

Datacrafter

AI-powered, schema-driven synthetic data generation platform.

Design datasets using YAML or generate them using natural language, and produce realistic data in CSV / JSON / JSONL / XML / Parquet formats.


✨ Key Highlights

  • Schema-driven generation using YAML
  • AI-powered schema creation from natural language prompts
  • Formula engine for dynamic, cross-field computations
  • Deterministic output using seed control
  • Multiple formats: CSV, JSON, JSONL, XML, Parquet
  • Plugin architecture for extensibility
  • CLI + Python API

🤖 AI Schema Generation

Generate dataset schemas directly from prompts and save to a file:

datacrafter ai --prompt "
Generate a banking transactions dataset with:

transaction_id (uuid),
account_number (integer between 100000000000 and 999999999999),
transaction_type (categorical: Debit, Credit, Withdrawal, Deposit, Transfer),
amount (float between 50 and 50000),
currency (categorical: USD, EUR, GBP, INR),
timestamp (datetime between 2023-01-01 and 2025-12-31 in '%Y-%m-%d %H:%M:%S'),
merchant_name (categorical: Amazon, Walmart, Starbucks, Uber, Apple Store, Shell Fuel Station, Best Buy, ATM Withdrawal, Bank Transfer).

Output format: xml.
Return ONLY valid Datacrafter YAML schema.
" --out examples/banking_transactions.yaml

✔ AI-generated schema will be saved to the specified output file.


🧮 Formula Engine

Create dynamic fields using expressions:

total_price:
  type: formula
  expr: "price * quantity"

Supports:

  • Arithmetic operations
  • Comparisons and boolean logic
  • Ternary expressions
  • String concatenation
  • Cross-field access

📦 Installation

pip install datacrafter-ai

Requirements: Python 3.9+


🚀 Quickstart

1. Create Schema

version: 1
rows: 10

fields:
  price:
    type: float

  quantity:
    type: integer

  total:
    type: formula
    expr: "price * quantity"

output:
  format: csv
  path: ./output/data.csv

2. Generate Data

datacrafter generate --schema schema.yaml

🧩 Built-in Capabilities

Providers

  • uuid, id.incremental
  • integer, float, boolean
  • person., text., string.regex
  • datetime, categorical, geo.country
  • formula

Features

  • Unique constraints
  • Null handling
  • Regex validation
  • Distributions
  • Templating & dependencies
  • Cross-field computation

🖥️ CLI Commands

datacrafter generate --schema schema.yaml
datacrafter validate --schema schema.yaml
datacrafter list providers
datacrafter list writers
datacrafter init --template minimal
datacrafter ai --prompt "..." --out schema.yaml

🔌 Extensibility

Datacrafter supports plugins for:

  • Custom providers
  • Custom writers

No core modification required.


⚙️ AI Configuration

Datacrafter’s AI features support multiple LLM providers and require API credentials.

1. Create a .env file

Copy the example configuration:

cp .env.example .env

2. Configure your provider and model

Edit .env and choose one provider:

# Choose one provider: openrouter / openai / gemini / groq / deepseek
LLM_PROVIDER=openai

# Choose the model supported by the provider
LLM_MODEL=gpt-4

3. Add the corresponding API key

Provide ONLY the API key for your selected provider:

OPENAI_API_KEY=your_api_key_here

Examples for other providers:

OPENROUTER_API_KEY=your_key
GEMINI_API_KEY=your_key
GROQ_API_KEY=your_key
DEEPSEEK_API_KEY=your_key

4. Run AI schema generation

datacrafter ai --prompt "..." --out schema.yaml

⚠️ Important:

  • AI features will NOT work without valid API credentials
  • Only one provider needs to be configured
  • Ensure the selected model is supported by the chosen provider

📦 Development

python -m build
twine check dist/*
twine upload dist/*

🔒 License

MIT © 2026 Mahalakshmi Shanmuga Sundaram


🏢 About

Datacrafter is developed and maintained by DHS Tech Services.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

datacrafter_ai-1.0.1.tar.gz (29.4 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

datacrafter_ai-1.0.1-py3-none-any.whl (39.2 kB view details)

Uploaded Python 3

File details

Details for the file datacrafter_ai-1.0.1.tar.gz.

File metadata

  • Download URL: datacrafter_ai-1.0.1.tar.gz
  • Upload date:
  • Size: 29.4 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.13.7

File hashes

Hashes for datacrafter_ai-1.0.1.tar.gz
Algorithm Hash digest
SHA256 2e1ae04ea1e52c12aa824ddd3687531a22bfa9102c7d94311dbecfbc5c7ae64e
MD5 2ef55dc3e56d0fcdb6af640e00a08003
BLAKE2b-256 fd74beb33365085f7760d5154128c6359f74aec7cadb9bedbc0ebe4b708df964

See more details on using hashes here.

File details

Details for the file datacrafter_ai-1.0.1-py3-none-any.whl.

File metadata

  • Download URL: datacrafter_ai-1.0.1-py3-none-any.whl
  • Upload date:
  • Size: 39.2 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.13.7

File hashes

Hashes for datacrafter_ai-1.0.1-py3-none-any.whl
Algorithm Hash digest
SHA256 c5a4f4fe2df6b50d9982f3fde386bedb37ba964a1f07adb18b6e560398a6f9da
MD5 14937edb680ec4d2dd1b0ca1e2e7d291
BLAKE2b-256 c304e7975f2c8bd9d7ec6a0c09057834997064f84e23474b848c7868b0ba5145

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page