Skip to main content

Krita (कृत): Create synthetic datasets using LLMs from schemas

Project description

Krita (कृत)

Sanskrit: "made, created, formed" - the root of "Sanskrit" itself

Generate synthetic datasets using LLMs from schemas and upload to Hugging Face.

Quick Start

pip install krita

# Create a schema
krita init-schema schema.yaml

# Generate data
krita generate schema.yaml --output dataset.json

# Upload to Hugging Face
krita upload dataset.json username/my-dataset

Features

  • Schema-driven generation: Define your data structure with field types, constraints, and examples
  • Multiple LLM providers: OpenAI GPT, Anthropic Claude, and custom OpenAI-compatible endpoints
  • Custom endpoint support: Use any OpenAI-compatible API endpoint
  • Automatic validation: Ensures generated data matches your schema
  • Hugging Face integration: Direct upload to Hugging Face Hub with metadata
  • Multiple formats: JSON, JSONL, CSV, Parquet output
  • CLI and Python API: Use from command line or integrate into your code

Installation

pip install krita

Python API

from krita import SyntheticDataGenerator, DataSchema, FieldType

# Define schema
schema = DataSchema(
    name="customer_reviews",
    description="Product reviews dataset",
    num_samples=1000,
    fields=[
        {"name": "product", "type": FieldType.TITLE, "required": True},
        {"name": "rating", "type": FieldType.NUMBER, "constraints": {"min": 1, "max": 5}},
        {"name": "review", "type": FieldType.REVIEW, "required": True},
        {"name": "reviewer", "type": FieldType.NAME, "required": True}
    ]
)

# Generate data
generator = SyntheticDataGenerator(llm_provider="openai")
data = generator.generate(schema)

# Upload to Hugging Face
from krita import HuggingFaceUploader
uploader = HuggingFaceUploader()
uploader.upload_dataset(data, "username/customer-reviews")

Custom AI Endpoints

Use any OpenAI-compatible endpoint (Ollama, vLLM, custom deployments):

from krita import SyntheticDataGenerator

# Use your custom endpoint directly
generator = SyntheticDataGenerator(
    llm_provider="openai",  # Use OpenAI-compatible interface
    base_url="https://your-api.com/v1",  # Your custom endpoint
    llm_model="your-model-name",
    api_key="your-api-key"  # Optional, if required
)

data = generator.generate(schema)

Using Custom Types

from krita import SyntheticDataGenerator, DataSchema, FieldSchema, FieldType

# Define schema with custom types
schema = DataSchema(
    name="healthcare_records",
    description="Patient healthcare records",
    num_samples=50,
    fields=[
        FieldSchema(name="patient_id", type=FieldType.UUID, required=True),
        FieldSchema(name="name", type=FieldType.NAME, required=True),
        FieldSchema(
            name="diagnosis",
            type="icd_diagnosis",  # Custom type
            description="Primary diagnosis",
            custom_type_definition="ICD-10 diagnosis with code and description",
            examples=["E11.9 - Type 2 diabetes mellitus"],
            required=True
        ),
        FieldSchema(
            name="medication",
            type=FieldType.CUSTOM,  # Using CUSTOM enum
            description="Current medication",
            custom_type_definition="Medication name, dosage, and frequency",
            examples=["Metformin 500mg twice daily"],
            required=False
        )
    ]
)

# Generate data with custom types
generator = SyntheticDataGenerator(llm_provider="openai")
data = generator.generate(schema)

Schema Format

name: "user_profiles"
description: "User profile data"
num_samples: 500
context: "Generate diverse, realistic user profiles"
fields:
  - name: "id"
    type: "uuid"
    required: true
  - name: "name"
    type: "name"
    required: true
    examples: ["John Doe", "Jane Smith"]
  - name: "email"
    type: "email"
    required: true
  - name: "age"
    type: "number"
    constraints:
      min: 18
      max: 80
  - name: "bio"
    type: "description"
    required: false

Supported Field Types

Built-in Types

  • text, name, email, phone, address
  • date, number, boolean, uuid
  • category, url, json
  • title, description, review

Custom Types

Define your own field types for specialized domains:

fields:
  - name: "medical_diagnosis"
    type: "icd_diagnosis"  # Custom type name
    description: "Medical diagnosis"
    custom_type_definition: "ICD-10 diagnosis code with description (e.g., 'E11.9 - Type 2 diabetes')"
    examples:
      - "I10 - Essential hypertension"
      - "E78.5 - Hyperlipidemia"

  - name: "certification"
    type: "custom"  # Use 'custom' enum value
    description: "Professional certification"
    custom_type_definition: "Professional certification with issuing body and expiration date"
    examples:
      - "AWS Solutions Architect - Valid until 2025-12-31"

CLI Commands

# Initialize schema
krita init-schema schema.yaml

# Generate data
krita generate schema.yaml --provider openai --output data.json

# Upload to Hugging Face
krita upload data.json username/dataset-name --description "My dataset"

# List providers
krita list-providers

Configuration

Set environment variables:

export OPENAI_API_KEY="your-key"
export ANTHROPIC_API_KEY="your-key"
export HF_TOKEN="your-token"

Custom Endpoint Examples

Ollama (local):

generator = SyntheticDataGenerator(
    llm_provider="openai",
    base_url="http://localhost:11434/v1",
    llm_model="llama3.1"
)

vLLM deployment:

generator = SyntheticDataGenerator(
    llm_provider="openai",
    base_url="https://your-vllm-server.com/v1",
    llm_model="meta-llama/Llama-3.1-8B-Instruct",
    api_key="your-api-key"
)

Internal enterprise endpoint:

generator = SyntheticDataGenerator(
    llm_provider="openai",
    base_url="https://internal-ai.company.com/v1",
    llm_model="company-model-v1",
    api_key="your-api-key"
)

License

MIT

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

krita-0.1.3.tar.gz (17.8 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

krita-0.1.3-py3-none-any.whl (14.4 kB view details)

Uploaded Python 3

File details

Details for the file krita-0.1.3.tar.gz.

File metadata

  • Download URL: krita-0.1.3.tar.gz
  • Upload date:
  • Size: 17.8 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.12.10

File hashes

Hashes for krita-0.1.3.tar.gz
Algorithm Hash digest
SHA256 1bdb0b398f93ae16647ff4736015d56cfa27629d106cd6adb8d53a72292b7761
MD5 d305a3c1f8e9c15567b2af9acafee957
BLAKE2b-256 34b35d8084d24f674e3589c4a56b9d65c39277ff3e3458c58dbc87aebbd73644

See more details on using hashes here.

File details

Details for the file krita-0.1.3-py3-none-any.whl.

File metadata

  • Download URL: krita-0.1.3-py3-none-any.whl
  • Upload date:
  • Size: 14.4 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.12.10

File hashes

Hashes for krita-0.1.3-py3-none-any.whl
Algorithm Hash digest
SHA256 d4dcadb0c225248167c11eeb399a8ecc6a52741fa39597d21e03a0eaf9fb511c
MD5 be70ecba6951d2a51ac26b1f2ab06f32
BLAKE2b-256 4b5df4f483f398dc6aaff8b11d7bebac6c62562be27e09e190bf9eaf5d264fbf

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page