Skip to main content

Krita (कृत): Create synthetic datasets using LLMs from schemas

Project description

Krita

Generate synthetic datasets using LLMs from schemas. Upload to Hugging Face.

Quick Start

pip install krita
krita generate schema.yaml --output dataset.json
from krita import SyntheticDataGenerator, DataSchema, FieldType

schema = DataSchema(
    name="reviews",
    num_samples=100,
    fields=[
        {"name": "product", "type": FieldType.TITLE, "required": True},
        {"name": "rating", "type": FieldType.NUMBER, "constraints": {"min": 1, "max": 5}},
        {"name": "review", "type": FieldType.REVIEW, "required": True}
    ]
)

generator = SyntheticDataGenerator(llm_provider="openai")
data = generator.generate(schema)

Features

  • Schema-driven: Define data structure with types, constraints, examples
  • Multiple LLMs: OpenAI, Anthropic, custom OpenAI-compatible endpoints
  • Custom endpoints: Ollama, vLLM, enterprise deployments
  • Validation: Ensures data matches schema
  • Hugging Face: Direct upload with metadata
  • Multiple formats: JSON, CSV, Parquet output

Custom Endpoints

Use any OpenAI-compatible API:

generator = SyntheticDataGenerator(
    llm_provider="openai",
    base_url="https://your-api.com/v1",  # Your endpoint
    llm_model="your-model",
    api_key="your-key"
)

Examples:

  • Ollama: base_url="http://localhost:11434/v1"
  • vLLM: base_url="https://your-vllm.com/v1"
  • Enterprise: base_url="https://internal-ai.company.com/v1"

Schema Format

name: "user_profiles"
description: "User profile data"
num_samples: 500
fields:
  - name: "name"
    type: "name"
    required: true
  - name: "email"
    type: "email"
    required: true
  - name: "age"
    type: "number"
    constraints: {min: 18, max: 80}

Field Types

Built-in: text, name, email, phone, address, date, number, boolean, uuid, category, url, json, title, description, review

Custom: Define domain-specific types:

fields:
  - name: "diagnosis"
    type: "icd_code"  # Custom type
    custom_type_definition: "ICD-10 diagnosis with code and description"
    examples: ["E11.9 - Type 2 diabetes mellitus"]

CLI

krita init-schema schema.yaml        # Create template
krita generate schema.yaml           # Generate data
krita upload data.json user/dataset  # Upload to HF

Configuration

export OPENAI_API_KEY="your-key"
export ANTHROPIC_API_KEY="your-key"
export HF_TOKEN="your-token"

License

MIT

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

krita-0.1.5.tar.gz (16.2 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

krita-0.1.5-py3-none-any.whl (13.5 kB view details)

Uploaded Python 3

File details

Details for the file krita-0.1.5.tar.gz.

File metadata

  • Download URL: krita-0.1.5.tar.gz
  • Upload date:
  • Size: 16.2 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.12.10

File hashes

Hashes for krita-0.1.5.tar.gz
Algorithm Hash digest
SHA256 e619a522d37cdbf87a5425d6c6974119f66e2cffb0525a66e68e9018118adf29
MD5 27a5a19c8d3a2e54c298756a3222d047
BLAKE2b-256 e88d32bf7aa240e88c755b6f1225bae7964db4968bc649314d10e3d68414057e

See more details on using hashes here.

File details

Details for the file krita-0.1.5-py3-none-any.whl.

File metadata

  • Download URL: krita-0.1.5-py3-none-any.whl
  • Upload date:
  • Size: 13.5 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.12.10

File hashes

Hashes for krita-0.1.5-py3-none-any.whl
Algorithm Hash digest
SHA256 85398611e4eb8667156587a198e6d7cff6feccb5d294c8305c2a3e8a2faa94ad
MD5 2fa8d11b0108184f2797764f52f258d1
BLAKE2b-256 c7e9b5ec658c7fa1ac75bd4a19405193adc799ee5a041d59b9871403c5e8d9f8

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page