Skip to main content

High-quality dataset generation CLI for LLM training

Project description

KothaSet

Go Version npm version PyPI version License

KothaSet is a powerful CLI tool for generating high-quality datasets using Large Language Models (LLMs) as teacher models. Create diverse training data for fine-tuning smaller models.

Features

  • Multi-Provider — OpenAI, and OpenAI-compatible APIs (DeepSeek, vLLM, Ollama)
  • Flexible Schemas — Instruction (Alpaca), Chat (ShareGPT), Preference (DPO), Classification
  • Streaming Output — Real-time generation with progress tracking
  • Resumable — Atomic checkpointing, never lose progress
  • Multiple Formats — JSONL, Native Parquet, HuggingFace datasets
  • Reproducible — Required seed for deterministic LLM generation
  • Diversity Control — Input files for sequential topic coverage
  • Validation — Validate configs, schemas, datasets, and provider connectivity

Installation

pip (Python)

pip install kothaset

npm (Node.js)

npm install -g kothaset

Homebrew (macOS/Linux)

brew install shantoislamdev/tap/kothaset

Binary Download

Download from GitHub Releases.

From Source

go install github.com/shantoislamdev/kothaset/cmd/kothaset@latest

Quick Start

  1. Initialize configuration:

    kothaset init
    
  2. Set your API key:

    # Windows PowerShell
    $env:OPENAI_API_KEY = "sk-..."
    
    # Linux/macOS
    export OPENAI_API_KEY="sk-..."
    
  3. Generate a dataset:

    kothaset generate -n 100 -s instruction --seed 42 -i topics.txt -o dataset.jsonl
    

Configuration

KothaSet uses a two-file configuration system for better security and organization:

1. kothaset.yaml (Public)

Contains shared settings, context, and instructions. Safe to commit to git.

version: "1.0"
global:
  provider: openai
  schema: instruction
  model: gpt-5.2
  concurrency: 4
  output_dir: ./output

# Context: Background info or persona injected into every prompt
context: |
  Generate high-quality training data for an AI assistant.
  The data should be helpful, accurate, and well-formatted.

# Instructions: Specific rules and guidelines for generation
instructions:
  - Be creative and diverse in topics and approaches
  - Vary the style and complexity of responses
  - Use clear and concise language

2. .secrets.yaml (Private)

Contains sensitive provider credentials. Add this to your .gitignore!

providers:
  - name: openai
    type: openai
    api_key: env.OPENAI_API_KEY  # Reads from environment variable
    # api_key: sk-...            # Or hardcode key directly
    timeout: 1m
    rate_limit:
      requests_per_minute: 60

  # Custom endpoint example (DeepSeek, vLLM)
  - name: local
    type: openai
    base_url: http://localhost:8000/v1
    api_key: not-needed

Usage

Selecting a Schema

Schema Description Use Case
instruction Alpaca-style {instruction, input, output} SFT
chat ShareGPT multi-turn conversations Chat fine-tuning
preference {prompt, chosen, rejected} pairs DPO/RLHF
classification {text, label} pairs Classifiers
# Instruction dataset
kothaset generate -n 1000 -s instruction --seed 42 -i topics.txt -o instructions.jsonl

# Chat conversations
kothaset generate -n 500 -s chat --seed 123 -i conversations.txt -o conversations.jsonl

# Preference pairs for DPO  
kothaset generate -n 500 -s preference --seed 456 -i pairs.txt -o dpo_data.jsonl

Output Formats

# JSONL (default)
kothaset generate -n 100 --seed 42 -i topics.txt -f jsonl -o dataset.jsonl

# Parquet
kothaset generate -n 100 --seed 42 -i topics.txt -f parquet -o dataset.parquet

# HuggingFace datasets format
kothaset generate -n 100 --seed 42 -i topics.txt -f hf -o ./my_dataset

Advanced Options

# Use custom provider
kothaset generate -n 100 --seed 42 -i topics.txt -p local -o dataset.jsonl

# Control diversity with input file
kothaset generate -n 1000 --seed 42 -i topics.txt -o diverse.jsonl

# Resume interrupted generation
kothaset generate --resume dataset.jsonl.checkpoint

# Dry run (validate config)
kothaset generate --dry-run -n 100 --seed 42 -i topics.txt

Documentation

Getting Started

Reference

Help


Contributing

Contributions welcome! See CONTRIBUTING.md.

License

Apache 2.0 License. See LICENSE.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distributions

No source distribution files available for this release.See tutorial on generating distribution archives.

Built Distributions

If you're not sure about the file name format, learn more about wheel file names.

kothaset-1.0.2-py3-none-win_amd64.whl (4.3 MB view details)

Uploaded Python 3Windows x86-64

kothaset-1.0.2-py3-none-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (4.2 MB view details)

Uploaded Python 3manylinux: glibc 2.17+ x86-64

kothaset-1.0.2-py3-none-manylinux_2_17_aarch64.manylinux2014_aarch64.whl (3.8 MB view details)

Uploaded Python 3manylinux: glibc 2.17+ ARM64

kothaset-1.0.2-py3-none-macosx_11_0_arm64.whl (4.0 MB view details)

Uploaded Python 3macOS 11.0+ ARM64

kothaset-1.0.2-py3-none-macosx_10_12_x86_64.whl (4.3 MB view details)

Uploaded Python 3macOS 10.12+ x86-64

File details

Details for the file kothaset-1.0.2-py3-none-win_amd64.whl.

File metadata

  • Download URL: kothaset-1.0.2-py3-none-win_amd64.whl
  • Upload date:
  • Size: 4.3 MB
  • Tags: Python 3, Windows x86-64
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.7

File hashes

Hashes for kothaset-1.0.2-py3-none-win_amd64.whl
Algorithm Hash digest
SHA256 a0e1e8dbab96317900113b4f37efa1b81fbdc87922c0cce69b10d2d82c2d14b2
MD5 4ea606a97c6706cc14f15ce9eba6b6f5
BLAKE2b-256 21a43d7d27707e9974bca6c3dd33b7dd29d946bbdd34ea7be0442f84c72ad365

See more details on using hashes here.

Provenance

The following attestation bundles were made for kothaset-1.0.2-py3-none-win_amd64.whl:

Publisher: release.yml on shantoislamdev/kothaset

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file kothaset-1.0.2-py3-none-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.

File metadata

File hashes

Hashes for kothaset-1.0.2-py3-none-manylinux_2_17_x86_64.manylinux2014_x86_64.whl
Algorithm Hash digest
SHA256 c6e31045f13d489caf04ee31f15278301670c63897d022965422855118156231
MD5 d4c304a8a112ad01a342c3defddf4efd
BLAKE2b-256 62b596342fabe4272b6eb14593031ee1c631d4ac8a2df7f9b8f7e783056360d1

See more details on using hashes here.

Provenance

The following attestation bundles were made for kothaset-1.0.2-py3-none-manylinux_2_17_x86_64.manylinux2014_x86_64.whl:

Publisher: release.yml on shantoislamdev/kothaset

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file kothaset-1.0.2-py3-none-manylinux_2_17_aarch64.manylinux2014_aarch64.whl.

File metadata

File hashes

Hashes for kothaset-1.0.2-py3-none-manylinux_2_17_aarch64.manylinux2014_aarch64.whl
Algorithm Hash digest
SHA256 b5332d2e3b22dcb4eff3136538e5b178acac66d0832514d0ba1f043751fcdae6
MD5 84521a923b7dddd47fd5bc1d20933e40
BLAKE2b-256 5640323dc72f6dc208fa1847d5be6782fdb7c10a0ff4c3a60ba17663552bb792

See more details on using hashes here.

Provenance

The following attestation bundles were made for kothaset-1.0.2-py3-none-manylinux_2_17_aarch64.manylinux2014_aarch64.whl:

Publisher: release.yml on shantoislamdev/kothaset

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file kothaset-1.0.2-py3-none-macosx_11_0_arm64.whl.

File metadata

File hashes

Hashes for kothaset-1.0.2-py3-none-macosx_11_0_arm64.whl
Algorithm Hash digest
SHA256 c4426ee507359faf9777256ab36e8e5c6445612ba1367c40e2cffd68819582ed
MD5 db137eff6b9215341c4b9788b405939e
BLAKE2b-256 4d6f7b7b4c24260cd6ce19a3ca4f0ef89f0e9834fc56b1a0c78b45476b7f14bb

See more details on using hashes here.

Provenance

The following attestation bundles were made for kothaset-1.0.2-py3-none-macosx_11_0_arm64.whl:

Publisher: release.yml on shantoislamdev/kothaset

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file kothaset-1.0.2-py3-none-macosx_10_12_x86_64.whl.

File metadata

File hashes

Hashes for kothaset-1.0.2-py3-none-macosx_10_12_x86_64.whl
Algorithm Hash digest
SHA256 b6ca8502bf537f4af2b9c003ca64dba45a48c89f22c97f9fd31888095c216fa2
MD5 ce7a0878eeb4445738e6b521d264dd26
BLAKE2b-256 75e20352fe13eaae4d70c4c85062634228392b3f6b97b4d11d9e650a29ed341b

See more details on using hashes here.

Provenance

The following attestation bundles were made for kothaset-1.0.2-py3-none-macosx_10_12_x86_64.whl:

Publisher: release.yml on shantoislamdev/kothaset

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page