Skip to main content

A framework for building high-quality, verifiable evaluation datasets for LLMs.

Project description

Kushim: A Framework for Verifiable, Self-Optimizing LLM Evaluation Datasets

Kushim is a framework for generating high-quality, verifiable Question & Answer datasets. In an era of generative models, creating reliable evaluation data is one of the biggest challenges. Kushim addresses this by providing an end-to-end workflow built on two core principles: verifiability by design and self-optimizing quality.

It's not just about generating data; it's about generating trustworthy data that gets better on its own.

Kushim Illustration

The Kushim Philosophy: Core Concepts

1. Verifiable by Design

The biggest risk with synthetic data is factual inconsistency. Kushim is built to mitigate this risk. Every single question-answer pair generated by the pipeline is subjected to a strict validation step. An LLM-based validator checks if the generated answer is factually and unambiguously supported by the original source text. If a pair fails this check, it's discarded.

This ensures that your final dataset isn't just a collection of plausible-sounding questions, but a set of verifiable facts grounded in a source of truth.

2. Self-Optimizing Quality with DSPy

A static, one-size-fits-all prompt is not optimal. The best way to phrase a question depends on the source material. Kushim leverages the power of DSPy to create a self-improving pipeline.

Instead of just running a prompt, Kushim can "compile" it. It uses DSPy's optimizers (teleprompters) to:

  1. Generate a small training set from your source documents.
  2. Test multiple variations of prompts to see which ones produce the highest-quality, most verifiable Q&A pairs for your specific data.
  3. Save this "compiled" program, which contains the optimized, high-performance prompts.

This means Kushim learns from your data to improve its own performance, leading to a significantly higher-quality final dataset.

How It Works: The Kushim Pipeline Workflow

The Kushim pipeline integrates these concepts into an efficient, streaming workflow that proceeds in the following stages:

  1. Source & Fetch: The process begins by fetching raw documents from a designated Source, such as a Wikipedia article or a local file directory.

  2. Chunking: The fetched documents are broken down into smaller, manageable text chunks. This is a standard practice in RAG-style pipelines and prepares the data for the generation models.

  3. Self-Optimization (A One-Time "Compile" Step): This is the heart of Kushim's quality assurance process. Instead of using a static prompt, the pipeline:

    • Generates a Training Set: It takes a small sample of the chunks to create a temporary training set.
    • Optimizes Prompts: It uses DSPy to test multiple prompt variations, identifying the one that produces the highest-quality, most verifiable Q&A pairs for your specific data.
    • This "compiled" program, containing the optimized prompts, is saved and used for the main generation task.
  4. Generation: Using the high-performance prompts from the compilation step, the pipeline generates question-answer pairs from all of the text chunks.

  5. Validation & Filtering: Each generated Q&A pair is rigorously validated. An LLM checks if the answer is factually supported by its original source chunk. Pairs that pass validation proceed to the final dataset; those that fail are discarded.

This multi-stage process ensures that the final output is not only relevant but also verifiable and of the highest possible quality.

Getting Started

After installing Kushim (uv add kushim or pip install kushim), you can use its core components directly. The key is to instantiate the pipeline and run it. The optimization is handled for you—the first run compiles and saves the best prompts, and subsequent runs are fast.

# A conceptual example of using the Kushim pipeline
from kushim import pipeline, config, source

# 1. Choose your source and model
data_source = source.WikipediaSource()
pipeline_config = config.KushimConfig(
    model_name="openrouter/openai/gpt-4.1",
    fetch_kwargs={"mode": "search", "query": "History of coffee"}
)

# 2. Instantiate the pipeline
kushim_pipeline = pipeline.KushimPipeline(
    source=data_source,
    config=pipeline_config
)

# 3. Run it!
# This will automatically handle compiling and saving the optimized
# generator to a .json file for you on the first run.
validated_dataset, _ = kushim_pipeline.run(
    optimize=True,
    compiled_generator_path="compiled_coffee_generator.json"
)

print(validated_dataset)

For complete, runnable scripts demonstrating the full dataset creation lifecycle (merging, encryption, and pushing to the Hugging Face Hub), please see the examples/ directory in the GitHub repository.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

kushim-0.0.3.tar.gz (255.0 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

kushim-0.0.3-py3-none-any.whl (25.9 kB view details)

Uploaded Python 3

File details

Details for the file kushim-0.0.3.tar.gz.

File metadata

  • Download URL: kushim-0.0.3.tar.gz
  • Upload date:
  • Size: 255.0 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: uv/0.7.4

File hashes

Hashes for kushim-0.0.3.tar.gz
Algorithm Hash digest
SHA256 22d042fc93665e20f366344c05760b3c491754ebfa4ef03db14b01aeb013c49f
MD5 e880b4e787ba9883c63522924bdb08ff
BLAKE2b-256 11097e4eaa43b5bea78e5826ec35c3cd2af06ed38ba38f4040cf2c77533d5f6b

See more details on using hashes here.

File details

Details for the file kushim-0.0.3-py3-none-any.whl.

File metadata

  • Download URL: kushim-0.0.3-py3-none-any.whl
  • Upload date:
  • Size: 25.9 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: uv/0.7.4

File hashes

Hashes for kushim-0.0.3-py3-none-any.whl
Algorithm Hash digest
SHA256 fd0a0646d84ea6e9d2074ebfec2ffa7d2d57448cd4149ef5f1fe56822b2a4af2
MD5 8f5db30a2141e8d2a5d1ee00304ac616
BLAKE2b-256 8b5e5c4d5f49303864ea816be03f2bde061dc3e55f6818d33199e0dc3fe59636

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page