Skip to main content

Krita (कृत): Create synthetic datasets using LLMs from schemas

Project description

Krita

Generate synthetic datasets using LLMs from schemas. Upload to Hugging Face.

Quick Start

pip install krita
krita generate schema.yaml --output dataset.json
from krita import SyntheticDataGenerator, DataSchema, FieldType, HuggingFaceUploader

schema = DataSchema(
    name="reviews",
    num_samples=100,
    fields=[
        {"name": "product", "type": FieldType.TITLE, "required": True},
        {"name": "rating", "type": FieldType.NUMBER, "constraints": {"min": 1, "max": 5}},
        {"name": "review", "type": FieldType.REVIEW, "required": True}
    ]
)

# Generate data
generator = SyntheticDataGenerator(llm_provider="openai")
data = generator.generate(schema)

# Upload to Hugging Face
uploader = HuggingFaceUploader()
uploader.upload_dataset(data, "username/product-reviews")

Features

  • Schema-driven: Define data structure with types, constraints, examples
  • Multiple LLMs: OpenAI, Anthropic, custom OpenAI-compatible endpoints
  • Custom endpoints: Ollama, vLLM, enterprise deployments
  • Validation: Ensures data matches schema
  • Hugging Face: Direct upload with metadata
  • Multiple formats: JSON, CSV, Parquet output

Custom Endpoints

Use any OpenAI-compatible API:

generator = SyntheticDataGenerator(
    llm_provider="openai",
    base_url="https://your-api.com/v1",  # Your endpoint
    llm_model="your-model",
    api_key="your-key"
)

Examples:

  • Ollama: base_url="http://localhost:11434/v1"
  • vLLM: base_url="https://your-vllm.com/v1"
  • Enterprise: base_url="https://internal-ai.company.com/v1"

Schema Format

name: "user_profiles"
description: "User profile data"
num_samples: 500
fields:
  - name: "name"
    type: "name"
    required: true
  - name: "email"
    type: "email"
    required: true
  - name: "age"
    type: "number"
    constraints: {min: 18, max: 80}

Field Types

Built-in: text, name, email, phone, address, date, number, boolean, uuid, category, url, json, title, description, review

Custom: Define domain-specific types:

fields:
  - name: "diagnosis"
    type: "icd_code"  # Custom type
    custom_type_definition: "ICD-10 diagnosis with code and description"
    examples: ["E11.9 - Type 2 diabetes mellitus"]

CLI

krita init-schema schema.yaml        # Create template
krita generate schema.yaml           # Generate data
krita upload data.json user/dataset  # Upload to HF

Configuration

export OPENAI_API_KEY="your-key"
export ANTHROPIC_API_KEY="your-key"
export HF_TOKEN="your-token"

License

MIT

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

krita-0.1.6.tar.gz (16.3 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

krita-0.1.6-py3-none-any.whl (13.6 kB view details)

Uploaded Python 3

File details

Details for the file krita-0.1.6.tar.gz.

File metadata

  • Download URL: krita-0.1.6.tar.gz
  • Upload date:
  • Size: 16.3 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.12.10

File hashes

Hashes for krita-0.1.6.tar.gz
Algorithm Hash digest
SHA256 a08fc9867249b98f5088b459bf8220f0f114425ebc45d10f5f4ccb440c8b7049
MD5 340b55a464f18c31ec981ff3faa3b2da
BLAKE2b-256 27ffa882cdf3c0057db02f4cc731104c4c712445068bda2ded357d063341c82a

See more details on using hashes here.

File details

Details for the file krita-0.1.6-py3-none-any.whl.

File metadata

  • Download URL: krita-0.1.6-py3-none-any.whl
  • Upload date:
  • Size: 13.6 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.12.10

File hashes

Hashes for krita-0.1.6-py3-none-any.whl
Algorithm Hash digest
SHA256 843854bffe1d8cf5ce59cb6df21870c5ce93bed8ca2954a48c1665519ead9e86
MD5 a53da1d5b08d8c87180e179672d73aff
BLAKE2b-256 03a54abd739522cc53862329fac8d8fdcd74ce52869d72d1823ed0d4d291de2b

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page