Skip to main content

High-quality dataset generation CLI for LLM training

Project description

KothaSet

Go Version npm version PyPI version License

KothaSet is a powerful CLI tool for generating high-quality datasets using Large Language Models (LLMs) as teacher models. Create diverse training data for fine-tuning smaller models.

Features

  • Multi-Provider — OpenAI, and OpenAI-compatible APIs (DeepSeek, vLLM, Ollama)
  • Flexible Schemas — Instruction (Alpaca), Chat (ShareGPT), Preference (DPO), Classification
  • Streaming Output — Real-time generation with progress tracking
  • Resumable — Atomic checkpointing, never lose progress
  • JSONL Output — Streaming writes in standard JSONL format
  • Reproducible — Optional fixed seed for deterministic-style runs
  • Diversity Control — Input files for sequential topic coverage
  • Validation — Validate configs, schemas, datasets, and provider connectivity

Installation

pip (Python)

pip install kothaset

npm (Node.js)

npm install -g kothaset

Homebrew (macOS/Linux)

brew install shantoislamdev/tap/kothaset

Binary Download

Download from GitHub Releases.

From Source

go install github.com/shantoislamdev/kothaset/cmd/kothaset@latest

Quick Start

  1. Initialize configuration:

    kothaset init
    
  2. Set your API key:

    # Windows PowerShell
    $env:OPENAI_API_KEY = "sk-..."
    
    # Linux/macOS
    export OPENAI_API_KEY="sk-..."
    
  3. Generate a dataset:

    kothaset generate -n 100 -s instruction --seed 42 -i topics.txt -o dataset.jsonl
    

Configuration

KothaSet uses a two-file configuration system for better security and organization:

1. kothaset.yaml (Public)

Contains shared settings, context, and instructions. Safe to commit to git.

version: "1.0"
global:
  provider: openai
  schema: instruction
  model: gpt-5.2
  concurrency: 4
  output_dir: ./output
  checkpoint_every: 10  # Save checkpoint every N samples (default: 10)

# Context: Background info or persona injected into every prompt
context: |
  Generate high-quality training data for an AI assistant.
  The data should be helpful, accurate, and well-formatted.

# Instructions: Specific rules and guidelines for generation
instructions:
  - Be creative and diverse in topics and approaches
  - Vary the style and complexity of responses
  - Use clear and concise language

2. .secrets.yaml (Private)

Contains sensitive provider credentials. Add this to your .gitignore! kothaset init creates this file with owner-only permissions (0600 on Unix-like systems).

providers:
  - name: openai
    type: openai
    api_key: env.OPENAI_API_KEY  # Reads from environment variable
    # api_key: sk-...            # Or hardcode key directly
    timeout: 1m
    rate_limit:
      requests_per_minute: 60

  # Custom endpoint example (DeepSeek, vLLM)
  - name: local
    type: openai
    base_url: http://localhost:8000/v1
    api_key: not-needed

rate_limit.requests_per_minute is actively enforced during generation. Lower values reduce request throughput.


Usage

Selecting a Schema

Schema Description Use Case
instruction Alpaca-style {instruction, input, output} SFT
chat ShareGPT multi-turn conversations Chat fine-tuning
preference {prompt, chosen, rejected} pairs DPO/RLHF
classification {text, label} pairs Classifiers
# Instruction dataset
kothaset generate -n 1000 -s instruction --seed 42 -i topics.txt -o instructions.jsonl

# Chat conversations
kothaset generate -n 500 -s chat --seed 123 -i conversations.txt -o conversations.jsonl

# Preference pairs for DPO  
kothaset generate -n 500 -s preference --seed 456 -i pairs.txt -o dpo_data.jsonl

Output Formats

# JSONL (default)
kothaset generate -n 100 --seed 42 -i topics.txt -f jsonl -o dataset.jsonl

kothaset generate automatically creates parent directories for --output paths (for example, -o output/data/dataset.jsonl).

Advanced Options

# Use custom provider
kothaset generate -n 100 --seed 42 -i topics.txt -p local -o dataset.jsonl

# Control diversity with input file
kothaset generate -n 1000 --seed 42 -i topics.txt -o diverse.jsonl

# Resume interrupted generation
# (use the exact checkpoint filename from `.kothaset/`)
kothaset generate --resume .kothaset/<checkpoint-file>.checkpoint

# Dry run (validate config)
kothaset generate --dry-run -n 100 --seed 42 -i topics.txt

Documentation

Getting Started

Reference

Help


Contributing

Contributions welcome! See CONTRIBUTING.md.

License

Apache 2.0 License. See LICENSE.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distributions

No source distribution files available for this release.See tutorial on generating distribution archives.

Built Distributions

If you're not sure about the file name format, learn more about wheel file names.

kothaset-1.2.0-py3-none-win_amd64.whl (3.7 MB view details)

Uploaded Python 3Windows x86-64

kothaset-1.2.0-py3-none-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (3.6 MB view details)

Uploaded Python 3manylinux: glibc 2.17+ x86-64

kothaset-1.2.0-py3-none-manylinux_2_17_aarch64.manylinux2014_aarch64.whl (3.3 MB view details)

Uploaded Python 3manylinux: glibc 2.17+ ARM64

kothaset-1.2.0-py3-none-macosx_11_0_arm64.whl (3.4 MB view details)

Uploaded Python 3macOS 11.0+ ARM64

kothaset-1.2.0-py3-none-macosx_10_12_x86_64.whl (3.7 MB view details)

Uploaded Python 3macOS 10.12+ x86-64

File details

Details for the file kothaset-1.2.0-py3-none-win_amd64.whl.

File metadata

  • Download URL: kothaset-1.2.0-py3-none-win_amd64.whl
  • Upload date:
  • Size: 3.7 MB
  • Tags: Python 3, Windows x86-64
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.7

File hashes

Hashes for kothaset-1.2.0-py3-none-win_amd64.whl
Algorithm Hash digest
SHA256 5c604ba6f67ba151b44a94d355c41ff8eb69438244df42eb70bdb31593a2a6f5
MD5 fac2d7e4d66696e2f97714dcb6fe22b2
BLAKE2b-256 1f5f452515249b7de39099c48e15f8ea42cdf62fdb461919c56f7beee51c367f

See more details on using hashes here.

Provenance

The following attestation bundles were made for kothaset-1.2.0-py3-none-win_amd64.whl:

Publisher: release.yml on shantoislamdev/kothaset

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file kothaset-1.2.0-py3-none-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.

File metadata

File hashes

Hashes for kothaset-1.2.0-py3-none-manylinux_2_17_x86_64.manylinux2014_x86_64.whl
Algorithm Hash digest
SHA256 9b1b06696bf20cb8cfaf68183d3a6327ee69859d7ada4a94be7d31b84a14daae
MD5 00b55f8343de8ec1d11d949d0f79f29f
BLAKE2b-256 b154f501be650eda5c19eb7244e118db3d38136b054f6d19cf8864e80e4997f2

See more details on using hashes here.

Provenance

The following attestation bundles were made for kothaset-1.2.0-py3-none-manylinux_2_17_x86_64.manylinux2014_x86_64.whl:

Publisher: release.yml on shantoislamdev/kothaset

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file kothaset-1.2.0-py3-none-manylinux_2_17_aarch64.manylinux2014_aarch64.whl.

File metadata

File hashes

Hashes for kothaset-1.2.0-py3-none-manylinux_2_17_aarch64.manylinux2014_aarch64.whl
Algorithm Hash digest
SHA256 c098631a3e5c3f06e4cdfdd8a46ccc9fbba2896864f19e5c1724024b2c21b598
MD5 dfca640bfbd6bd066fd33f76cd1e73e3
BLAKE2b-256 df2c42729ff21cb0586762c18163af343b0496228f8fe2a59fb21f6412954498

See more details on using hashes here.

Provenance

The following attestation bundles were made for kothaset-1.2.0-py3-none-manylinux_2_17_aarch64.manylinux2014_aarch64.whl:

Publisher: release.yml on shantoislamdev/kothaset

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file kothaset-1.2.0-py3-none-macosx_11_0_arm64.whl.

File metadata

File hashes

Hashes for kothaset-1.2.0-py3-none-macosx_11_0_arm64.whl
Algorithm Hash digest
SHA256 5dc051eb109ce99476555040264b43fd14e8832da3a7b255867462961477a9d5
MD5 64093d856069780c83338c057cb98e1f
BLAKE2b-256 8af57364d151ddf0d4eb8f0663f71786c011c3b3981bc5d97af41187789264c5

See more details on using hashes here.

Provenance

The following attestation bundles were made for kothaset-1.2.0-py3-none-macosx_11_0_arm64.whl:

Publisher: release.yml on shantoislamdev/kothaset

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file kothaset-1.2.0-py3-none-macosx_10_12_x86_64.whl.

File metadata

File hashes

Hashes for kothaset-1.2.0-py3-none-macosx_10_12_x86_64.whl
Algorithm Hash digest
SHA256 21b55a57f837bbf9f90bbe0db822b8d3a2de7f62104d4d6310289aa34346c535
MD5 b1c9b48482b94f6ae736bfe173a37d0f
BLAKE2b-256 39c0042f52f1ceccf99f37260b0cd563095104061667c16d7879c0c187808ede

See more details on using hashes here.

Provenance

The following attestation bundles were made for kothaset-1.2.0-py3-none-macosx_10_12_x86_64.whl:

Publisher: release.yml on shantoislamdev/kothaset

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page