Skip to main content

Krita (कृत): Create synthetic datasets using LLMs from schemas

Project description

Krita (कृत)

Sanskrit: "made, created, formed" - the root of "Sanskrit" itself

Generate synthetic datasets using LLMs from schemas and upload to Hugging Face.

Quick Start

pip install krita

# Create a schema
krita init-schema schema.yaml

# Generate data
krita generate schema.yaml --output dataset.json

# Upload to Hugging Face
krita upload dataset.json username/my-dataset

Features

  • Schema-driven generation: Define your data structure with field types, constraints, and examples
  • Multiple LLM providers: OpenAI GPT, Anthropic Claude, and custom OpenAI-compatible endpoints
  • Custom endpoint support: Use any OpenAI-compatible API endpoint
  • Automatic validation: Ensures generated data matches your schema
  • Hugging Face integration: Direct upload to Hugging Face Hub with metadata
  • Multiple formats: JSON, JSONL, CSV, Parquet output
  • CLI and Python API: Use from command line or integrate into your code

Installation

pip install krita

Python API

Note: Install as krita, import as synthetica

from synthetica import SyntheticDataGenerator, DataSchema, FieldType

# Define schema
schema = DataSchema(
    name="customer_reviews",
    description="Product reviews dataset",
    num_samples=1000,
    fields=[
        {"name": "product", "type": FieldType.TITLE, "required": True},
        {"name": "rating", "type": FieldType.NUMBER, "constraints": {"min": 1, "max": 5}},
        {"name": "review", "type": FieldType.REVIEW, "required": True},
        {"name": "reviewer", "type": FieldType.NAME, "required": True}
    ]
)

# Generate data
generator = SyntheticDataGenerator(llm_provider="openai")
data = generator.generate(schema)

# Upload to Hugging Face
from synthetica import HuggingFaceUploader
uploader = HuggingFaceUploader()
uploader.upload_dataset(data, "username/customer-reviews")

Custom AI Endpoints

Use any OpenAI-compatible endpoint (Ollama, vLLM, custom deployments):

from synthetica import SyntheticDataGenerator

# Use your custom endpoint directly
generator = SyntheticDataGenerator(
    llm_provider="openai",  # Use OpenAI-compatible interface
    base_url="https://your-api.com/v1",  # Your custom endpoint
    llm_model="your-model-name",
    api_key="your-api-key"  # Optional, if required
)

data = generator.generate(schema)

Using Custom Types

from synthetica import SyntheticDataGenerator, DataSchema, FieldSchema, FieldType

# Define schema with custom types
schema = DataSchema(
    name="healthcare_records",
    description="Patient healthcare records",
    num_samples=50,
    fields=[
        FieldSchema(name="patient_id", type=FieldType.UUID, required=True),
        FieldSchema(name="name", type=FieldType.NAME, required=True),
        FieldSchema(
            name="diagnosis",
            type="icd_diagnosis",  # Custom type
            description="Primary diagnosis",
            custom_type_definition="ICD-10 diagnosis with code and description",
            examples=["E11.9 - Type 2 diabetes mellitus"],
            required=True
        ),
        FieldSchema(
            name="medication",
            type=FieldType.CUSTOM,  # Using CUSTOM enum
            description="Current medication",
            custom_type_definition="Medication name, dosage, and frequency",
            examples=["Metformin 500mg twice daily"],
            required=False
        )
    ]
)

# Generate data with custom types
generator = SyntheticDataGenerator(llm_provider="openai")
data = generator.generate(schema)

Schema Format

name: "user_profiles"
description: "User profile data"
num_samples: 500
context: "Generate diverse, realistic user profiles"
fields:
  - name: "id"
    type: "uuid"
    required: true
  - name: "name"
    type: "name"
    required: true
    examples: ["John Doe", "Jane Smith"]
  - name: "email"
    type: "email"
    required: true
  - name: "age"
    type: "number"
    constraints:
      min: 18
      max: 80
  - name: "bio"
    type: "description"
    required: false

Supported Field Types

Built-in Types

  • text, name, email, phone, address
  • date, number, boolean, uuid
  • category, url, json
  • title, description, review

Custom Types

Define your own field types for specialized domains:

fields:
  - name: "medical_diagnosis"
    type: "icd_diagnosis"  # Custom type name
    description: "Medical diagnosis"
    custom_type_definition: "ICD-10 diagnosis code with description (e.g., 'E11.9 - Type 2 diabetes')"
    examples:
      - "I10 - Essential hypertension"
      - "E78.5 - Hyperlipidemia"

  - name: "certification"
    type: "custom"  # Use 'custom' enum value
    description: "Professional certification"
    custom_type_definition: "Professional certification with issuing body and expiration date"
    examples:
      - "AWS Solutions Architect - Valid until 2025-12-31"

CLI Commands

# Initialize schema
krita init-schema schema.yaml

# Generate data
krita generate schema.yaml --provider openai --output data.json

# Upload to Hugging Face
krita upload data.json username/dataset-name --description "My dataset"

# List providers
krita list-providers

Configuration

Set environment variables:

export OPENAI_API_KEY="your-key"
export ANTHROPIC_API_KEY="your-key"
export HF_TOKEN="your-token"

Custom Endpoint Examples

Ollama (local):

generator = SyntheticDataGenerator(
    llm_provider="openai",
    base_url="http://localhost:11434/v1",
    llm_model="llama3.1"
)

vLLM deployment:

generator = SyntheticDataGenerator(
    llm_provider="openai",
    base_url="https://your-vllm-server.com/v1",
    llm_model="meta-llama/Llama-3.1-8B-Instruct",
    api_key="your-api-key"
)

Internal enterprise endpoint:

generator = SyntheticDataGenerator(
    llm_provider="openai",
    base_url="https://internal-ai.company.com/v1",
    llm_model="company-model-v1",
    api_key="your-api-key"
)

License

MIT

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

krita-0.1.2.tar.gz (15.7 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

krita-0.1.2-py3-none-any.whl (14.6 kB view details)

Uploaded Python 3

File details

Details for the file krita-0.1.2.tar.gz.

File metadata

  • Download URL: krita-0.1.2.tar.gz
  • Upload date:
  • Size: 15.7 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.12.10

File hashes

Hashes for krita-0.1.2.tar.gz
Algorithm Hash digest
SHA256 e548098608c881951e7d0ca60c4b460ba874315f279f91faa0f176a804d80e50
MD5 36afa774f408a91439826365304fcde1
BLAKE2b-256 a705536eb2164de80cc9ec849e2a7e68473db4c3d82c5c0df74175ba8fb305b3

See more details on using hashes here.

File details

Details for the file krita-0.1.2-py3-none-any.whl.

File metadata

  • Download URL: krita-0.1.2-py3-none-any.whl
  • Upload date:
  • Size: 14.6 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.12.10

File hashes

Hashes for krita-0.1.2-py3-none-any.whl
Algorithm Hash digest
SHA256 9507723ca461c5c97b4bbfabdf39b7614c6a5e87cbc8e42b60c03526631fd115
MD5 8163dee204c21503b4b595b391677f1c
BLAKE2b-256 34652e08ed2ca1d3a5356c8e4336429dedec894544e3f67d5cbda392947d92be

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page