Skip to main content

Krita (कृत): Create synthetic datasets using LLMs from schemas

Project description

Krita (कृत)

Sanskrit: "made, created, formed" - the root of "Sanskrit" itself

Generate synthetic datasets using LLMs from schemas and upload to Hugging Face.

Quick Start

pip install krita

# Create a schema
krita init-schema schema.yaml

# Generate data
krita generate schema.yaml --output dataset.json

# Upload to Hugging Face
krita upload dataset.json username/my-dataset

Features

  • Schema-driven generation: Define your data structure with field types, constraints, and examples
  • Multiple LLM providers: OpenAI GPT, Anthropic Claude, and custom OpenAI-compatible endpoints
  • Custom endpoint support: Use any OpenAI-compatible API endpoint
  • Automatic validation: Ensures generated data matches your schema
  • Hugging Face integration: Direct upload to Hugging Face Hub with metadata
  • Multiple formats: JSON, JSONL, CSV, Parquet output
  • CLI and Python API: Use from command line or integrate into your code

Installation

pip install synthetica

Python API

from synthetica import SyntheticDataGenerator, DataSchema, FieldType

# Define schema
schema = DataSchema(
    name="customer_reviews",
    description="Product reviews dataset",
    num_samples=1000,
    fields=[
        {"name": "product", "type": FieldType.TITLE, "required": True},
        {"name": "rating", "type": FieldType.NUMBER, "constraints": {"min": 1, "max": 5}},
        {"name": "review", "type": FieldType.REVIEW, "required": True},
        {"name": "reviewer", "type": FieldType.NAME, "required": True}
    ]
)

# Generate data
generator = SyntheticDataGenerator(llm_provider="openai")
data = generator.generate(schema)

# Upload to Hugging Face
from synthetica import HuggingFaceUploader
uploader = HuggingFaceUploader()
uploader.upload_dataset(data, "username/customer-reviews")

Custom AI Endpoints

Use any OpenAI-compatible endpoint (Ollama, vLLM, custom deployments):

from synthetica.paypal_llm import CustomOpenAIProvider
from synthetica.generator import SyntheticDataGenerator

# Create custom provider
class CustomGenerator(SyntheticDataGenerator):
    def __init__(self, endpoint_url, model_name, **kwargs):
        self.llm = CustomOpenAIProvider(
            endpoint_url=endpoint_url,
            model=model_name,
            api_key=kwargs.get('api_key'),
            verify_ssl=kwargs.get('verify_ssl', True)
        )
        self.batch_size = kwargs.get('batch_size', 10)
        self.max_retries = kwargs.get('max_retries', 3)

# Use your custom endpoint
generator = CustomGenerator(
    endpoint_url="https://your-api.com/v1/chat/completions",
    model_name="your-model-name",
    verify_ssl=False  # For internal endpoints
)

data = generator.generate(schema)

Using Custom Types

from synthetica import SyntheticDataGenerator, DataSchema, FieldSchema, FieldType

# Define schema with custom types
schema = DataSchema(
    name="healthcare_records",
    description="Patient healthcare records",
    num_samples=50,
    fields=[
        FieldSchema(name="patient_id", type=FieldType.UUID, required=True),
        FieldSchema(name="name", type=FieldType.NAME, required=True),
        FieldSchema(
            name="diagnosis",
            type="icd_diagnosis",  # Custom type
            description="Primary diagnosis",
            custom_type_definition="ICD-10 diagnosis with code and description",
            examples=["E11.9 - Type 2 diabetes mellitus"],
            required=True
        ),
        FieldSchema(
            name="medication",
            type=FieldType.CUSTOM,  # Using CUSTOM enum
            description="Current medication",
            custom_type_definition="Medication name, dosage, and frequency",
            examples=["Metformin 500mg twice daily"],
            required=False
        )
    ]
)

# Generate data with custom types
generator = SyntheticDataGenerator(llm_provider="openai")
data = generator.generate(schema)

Schema Format

name: "user_profiles"
description: "User profile data"
num_samples: 500
context: "Generate diverse, realistic user profiles"
fields:
  - name: "id"
    type: "uuid"
    required: true
  - name: "name"
    type: "name"
    required: true
    examples: ["John Doe", "Jane Smith"]
  - name: "email"
    type: "email"
    required: true
  - name: "age"
    type: "number"
    constraints:
      min: 18
      max: 80
  - name: "bio"
    type: "description"
    required: false

Supported Field Types

Built-in Types

  • text, name, email, phone, address
  • date, number, boolean, uuid
  • category, url, json
  • title, description, review

Custom Types

Define your own field types for specialized domains:

fields:
  - name: "medical_diagnosis"
    type: "icd_diagnosis"  # Custom type name
    description: "Medical diagnosis"
    custom_type_definition: "ICD-10 diagnosis code with description (e.g., 'E11.9 - Type 2 diabetes')"
    examples:
      - "I10 - Essential hypertension"
      - "E78.5 - Hyperlipidemia"

  - name: "certification"
    type: "custom"  # Use 'custom' enum value
    description: "Professional certification"
    custom_type_definition: "Professional certification with issuing body and expiration date"
    examples:
      - "AWS Solutions Architect - Valid until 2025-12-31"

CLI Commands

# Initialize schema
krita init-schema schema.yaml

# Generate data
krita generate schema.yaml --provider openai --output data.json

# Upload to Hugging Face
krita upload data.json username/dataset-name --description "My dataset"

# List providers
krita list-providers

Configuration

Set environment variables:

export OPENAI_API_KEY="your-key"
export ANTHROPIC_API_KEY="your-key"
export HF_TOKEN="your-token"

Custom Endpoint Examples

Ollama (local):

generator = CustomGenerator(
    endpoint_url="http://localhost:11434/v1/chat/completions",
    model_name="llama3.1"
)

vLLM deployment:

generator = CustomGenerator(
    endpoint_url="https://your-vllm-server.com/v1/chat/completions",
    model_name="meta-llama/Llama-3.1-8B-Instruct",
    api_key="your-api-key"
)

Internal enterprise endpoint:

generator = CustomGenerator(
    endpoint_url="https://internal-ai.company.com/v1/chat/completions",
    model_name="company-model-v1",
    verify_ssl=False
)

License

MIT

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

krita-0.1.0.tar.gz (16.6 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

krita-0.1.0-py3-none-any.whl (15.9 kB view details)

Uploaded Python 3

File details

Details for the file krita-0.1.0.tar.gz.

File metadata

  • Download URL: krita-0.1.0.tar.gz
  • Upload date:
  • Size: 16.6 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.12.10

File hashes

Hashes for krita-0.1.0.tar.gz
Algorithm Hash digest
SHA256 8320a8123147ccbe1adcb0c4fa888d87fe70871f01909e771e3af7682582cb53
MD5 185b014edfe6c4ea2696cbba11bb5f34
BLAKE2b-256 01d798e94c8666f26efc817978a0cbf037fba6c7723ba1ac56aa6de19c316e86

See more details on using hashes here.

File details

Details for the file krita-0.1.0-py3-none-any.whl.

File metadata

  • Download URL: krita-0.1.0-py3-none-any.whl
  • Upload date:
  • Size: 15.9 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.12.10

File hashes

Hashes for krita-0.1.0-py3-none-any.whl
Algorithm Hash digest
SHA256 b42e3481c0a355c64b9b68a591006d535b1a02fc7707779386732e037e15999d
MD5 5831bd7a5ccd221e8d31f4e4b52e9efc
BLAKE2b-256 11223f210bffdb40b96a74e76102c186c4be1b64c57a34df36c90b21d383f5d4

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page