Skip to main content

Krita (कृत): Create synthetic datasets using LLMs from schemas

Project description

Krita (कृत)

Sanskrit: "made, created, formed" - the root of "Sanskrit" itself

Generate synthetic datasets using LLMs from schemas and upload to Hugging Face.

Quick Start

pip install krita

# Create a schema
krita init-schema schema.yaml

# Generate data
krita generate schema.yaml --output dataset.json

# Upload to Hugging Face
krita upload dataset.json username/my-dataset

Features

  • Schema-driven generation: Define your data structure with field types, constraints, and examples
  • Multiple LLM providers: OpenAI GPT, Anthropic Claude, and custom OpenAI-compatible endpoints
  • Custom endpoint support: Use any OpenAI-compatible API endpoint
  • Automatic validation: Ensures generated data matches your schema
  • Hugging Face integration: Direct upload to Hugging Face Hub with metadata
  • Multiple formats: JSON, JSONL, CSV, Parquet output
  • CLI and Python API: Use from command line or integrate into your code

Installation

pip install krita

Python API

Note: Install as krita, import as synthetica

from synthetica import SyntheticDataGenerator, DataSchema, FieldType

# Define schema
schema = DataSchema(
    name="customer_reviews",
    description="Product reviews dataset",
    num_samples=1000,
    fields=[
        {"name": "product", "type": FieldType.TITLE, "required": True},
        {"name": "rating", "type": FieldType.NUMBER, "constraints": {"min": 1, "max": 5}},
        {"name": "review", "type": FieldType.REVIEW, "required": True},
        {"name": "reviewer", "type": FieldType.NAME, "required": True}
    ]
)

# Generate data
generator = SyntheticDataGenerator(llm_provider="openai")
data = generator.generate(schema)

# Upload to Hugging Face
from synthetica import HuggingFaceUploader
uploader = HuggingFaceUploader()
uploader.upload_dataset(data, "username/customer-reviews")

Custom AI Endpoints

Use any OpenAI-compatible endpoint (Ollama, vLLM, custom deployments):

from synthetica.paypal_llm import CustomOpenAIProvider
from synthetica.generator import SyntheticDataGenerator

# Create custom provider
class CustomGenerator(SyntheticDataGenerator):
    def __init__(self, endpoint_url, model_name, **kwargs):
        self.llm = CustomOpenAIProvider(
            endpoint_url=endpoint_url,
            model=model_name,
            api_key=kwargs.get('api_key'),
            verify_ssl=kwargs.get('verify_ssl', True)
        )
        self.batch_size = kwargs.get('batch_size', 10)
        self.max_retries = kwargs.get('max_retries', 3)

# Use your custom endpoint
generator = CustomGenerator(
    endpoint_url="https://your-api.com/v1/chat/completions",
    model_name="your-model-name",
    verify_ssl=False  # For internal endpoints
)

data = generator.generate(schema)

Using Custom Types

from synthetica import SyntheticDataGenerator, DataSchema, FieldSchema, FieldType

# Define schema with custom types
schema = DataSchema(
    name="healthcare_records",
    description="Patient healthcare records",
    num_samples=50,
    fields=[
        FieldSchema(name="patient_id", type=FieldType.UUID, required=True),
        FieldSchema(name="name", type=FieldType.NAME, required=True),
        FieldSchema(
            name="diagnosis",
            type="icd_diagnosis",  # Custom type
            description="Primary diagnosis",
            custom_type_definition="ICD-10 diagnosis with code and description",
            examples=["E11.9 - Type 2 diabetes mellitus"],
            required=True
        ),
        FieldSchema(
            name="medication",
            type=FieldType.CUSTOM,  # Using CUSTOM enum
            description="Current medication",
            custom_type_definition="Medication name, dosage, and frequency",
            examples=["Metformin 500mg twice daily"],
            required=False
        )
    ]
)

# Generate data with custom types
generator = SyntheticDataGenerator(llm_provider="openai")
data = generator.generate(schema)

Schema Format

name: "user_profiles"
description: "User profile data"
num_samples: 500
context: "Generate diverse, realistic user profiles"
fields:
  - name: "id"
    type: "uuid"
    required: true
  - name: "name"
    type: "name"
    required: true
    examples: ["John Doe", "Jane Smith"]
  - name: "email"
    type: "email"
    required: true
  - name: "age"
    type: "number"
    constraints:
      min: 18
      max: 80
  - name: "bio"
    type: "description"
    required: false

Supported Field Types

Built-in Types

  • text, name, email, phone, address
  • date, number, boolean, uuid
  • category, url, json
  • title, description, review

Custom Types

Define your own field types for specialized domains:

fields:
  - name: "medical_diagnosis"
    type: "icd_diagnosis"  # Custom type name
    description: "Medical diagnosis"
    custom_type_definition: "ICD-10 diagnosis code with description (e.g., 'E11.9 - Type 2 diabetes')"
    examples:
      - "I10 - Essential hypertension"
      - "E78.5 - Hyperlipidemia"

  - name: "certification"
    type: "custom"  # Use 'custom' enum value
    description: "Professional certification"
    custom_type_definition: "Professional certification with issuing body and expiration date"
    examples:
      - "AWS Solutions Architect - Valid until 2025-12-31"

CLI Commands

# Initialize schema
krita init-schema schema.yaml

# Generate data
krita generate schema.yaml --provider openai --output data.json

# Upload to Hugging Face
krita upload data.json username/dataset-name --description "My dataset"

# List providers
krita list-providers

Configuration

Set environment variables:

export OPENAI_API_KEY="your-key"
export ANTHROPIC_API_KEY="your-key"
export HF_TOKEN="your-token"

Custom Endpoint Examples

Ollama (local):

generator = CustomGenerator(
    endpoint_url="http://localhost:11434/v1/chat/completions",
    model_name="llama3.1"
)

vLLM deployment:

generator = CustomGenerator(
    endpoint_url="https://your-vllm-server.com/v1/chat/completions",
    model_name="meta-llama/Llama-3.1-8B-Instruct",
    api_key="your-api-key"
)

Internal enterprise endpoint:

generator = CustomGenerator(
    endpoint_url="https://internal-ai.company.com/v1/chat/completions",
    model_name="company-model-v1",
    verify_ssl=False
)

License

MIT

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

krita-0.1.1.tar.gz (16.6 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

krita-0.1.1-py3-none-any.whl (15.9 kB view details)

Uploaded Python 3

File details

Details for the file krita-0.1.1.tar.gz.

File metadata

  • Download URL: krita-0.1.1.tar.gz
  • Upload date:
  • Size: 16.6 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.12.10

File hashes

Hashes for krita-0.1.1.tar.gz
Algorithm Hash digest
SHA256 d92590107b41227f9d9a981645baf6fbc71bfaf2d1bcf7ab621b927db1e8d59a
MD5 01928a6e599bf698ab3a7788f5537453
BLAKE2b-256 bdfbd28ee68bdeac87d5c61b2bf992eb2a2d39921f228ec0187f0e943ff3c68b

See more details on using hashes here.

File details

Details for the file krita-0.1.1-py3-none-any.whl.

File metadata

  • Download URL: krita-0.1.1-py3-none-any.whl
  • Upload date:
  • Size: 15.9 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.12.10

File hashes

Hashes for krita-0.1.1-py3-none-any.whl
Algorithm Hash digest
SHA256 521efa540bbfe260b896267af4259447c6b1faadd846036e8d38d9cbaeae95b7
MD5 668a754c15924ccff2b48aab0cbfef83
BLAKE2b-256 9331c77731a76219c65dcb58b512e6c5e2d274483571646a27db3c0129b0bcb1

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page