Skip to main content

High-quality dataset generation CLI for LLM training

Project description

KothaSet

Go Version npm version PyPI version License

KothaSet is a powerful CLI tool for generating high-quality datasets using Large Language Models (LLMs) as teacher models. Create diverse training data for fine-tuning smaller models.

Features

  • Multi-Provider — OpenAI, and OpenAI-compatible APIs (DeepSeek, vLLM, Ollama)
  • Flexible Schemas — Instruction (Alpaca), Chat (ShareGPT), Preference (DPO), Classification
  • Streaming Output — Real-time generation with progress tracking
  • Resumable — Atomic checkpointing, never lose progress
  • Multiple Formats — JSONL, Native Parquet, HuggingFace datasets
  • Reproducible — Required seed for deterministic LLM generation
  • Diversity Control — Input files for sequential topic coverage
  • Validation — Validate configs, schemas, datasets, and provider connectivity

Installation

pip (Python)

pip install kothaset

npm (Node.js)

npm install -g kothaset

Homebrew (macOS/Linux)

brew install shantoislamdev/tap/kothaset

Binary Download

Download from GitHub Releases.

From Source

go install github.com/shantoislamdev/kothaset/cmd/kothaset@latest

Quick Start

  1. Initialize configuration:

    kothaset init
    
  2. Set your API key:

    # Windows PowerShell
    $env:OPENAI_API_KEY = "sk-..."
    
    # Linux/macOS
    export OPENAI_API_KEY="sk-..."
    
  3. Generate a dataset:

    kothaset generate -n 100 -s instruction --seed 42 -i topics.txt -o dataset.jsonl
    

Configuration

KothaSet uses a two-file configuration system for better security and organization:

1. kothaset.yaml (Public)

Contains shared settings, context, and instructions. Safe to commit to git.

version: "1.0"
global:
  provider: openai
  schema: instruction
  model: gpt-5.2
  concurrency: 4
  output_dir: ./output

# Context: Background info or persona injected into every prompt
context: |
  Generate high-quality training data for an AI assistant.
  The data should be helpful, accurate, and well-formatted.

# Instructions: Specific rules and guidelines for generation
instructions:
  - Be creative and diverse in topics and approaches
  - Vary the style and complexity of responses
  - Use clear and concise language

2. .secrets.yaml (Private)

Contains sensitive provider credentials. Add this to your .gitignore!

providers:
  - name: openai
    type: openai
    api_key: env.OPENAI_API_KEY  # Reads from environment variable
    # api_key: sk-...            # Or hardcode key directly
    timeout: 1m
    rate_limit:
      requests_per_minute: 60

  # Custom endpoint example (DeepSeek, vLLM)
  - name: local
    type: openai
    base_url: http://localhost:8000/v1
    api_key: not-needed

Usage

Selecting a Schema

Schema Description Use Case
instruction Alpaca-style {instruction, input, output} SFT
chat ShareGPT multi-turn conversations Chat fine-tuning
preference {prompt, chosen, rejected} pairs DPO/RLHF
classification {text, label} pairs Classifiers
# Instruction dataset
kothaset generate -n 1000 -s instruction --seed 42 -i topics.txt -o instructions.jsonl

# Chat conversations
kothaset generate -n 500 -s chat --seed 123 -i conversations.txt -o conversations.jsonl

# Preference pairs for DPO  
kothaset generate -n 500 -s preference --seed 456 -i pairs.txt -o dpo_data.jsonl

Output Formats

# JSONL (default)
kothaset generate -n 100 --seed 42 -i topics.txt -f jsonl -o dataset.jsonl

# Parquet
kothaset generate -n 100 --seed 42 -i topics.txt -f parquet -o dataset.parquet

# HuggingFace datasets format
kothaset generate -n 100 --seed 42 -i topics.txt -f hf -o ./my_dataset

Advanced Options

# Use custom provider
kothaset generate -n 100 --seed 42 -i topics.txt -p local -o dataset.jsonl

# Control diversity with input file
kothaset generate -n 1000 --seed 42 -i topics.txt -o diverse.jsonl

# Resume interrupted generation
kothaset generate --resume dataset.jsonl.checkpoint

# Dry run (validate config)
kothaset generate --dry-run -n 100 --seed 42 -i topics.txt

Documentation

Getting Started

Reference

Help


Contributing

Contributions welcome! See CONTRIBUTING.md.

License

Apache 2.0 License. See LICENSE.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distributions

No source distribution files available for this release.See tutorial on generating distribution archives.

Built Distributions

If you're not sure about the file name format, learn more about wheel file names.

kothaset-1.0.3-py3-none-win_amd64.whl (4.3 MB view details)

Uploaded Python 3Windows x86-64

kothaset-1.0.3-py3-none-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (4.2 MB view details)

Uploaded Python 3manylinux: glibc 2.17+ x86-64

kothaset-1.0.3-py3-none-manylinux_2_17_aarch64.manylinux2014_aarch64.whl (3.8 MB view details)

Uploaded Python 3manylinux: glibc 2.17+ ARM64

kothaset-1.0.3-py3-none-macosx_11_0_arm64.whl (4.0 MB view details)

Uploaded Python 3macOS 11.0+ ARM64

kothaset-1.0.3-py3-none-macosx_10_12_x86_64.whl (4.3 MB view details)

Uploaded Python 3macOS 10.12+ x86-64

File details

Details for the file kothaset-1.0.3-py3-none-win_amd64.whl.

File metadata

  • Download URL: kothaset-1.0.3-py3-none-win_amd64.whl
  • Upload date:
  • Size: 4.3 MB
  • Tags: Python 3, Windows x86-64
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.7

File hashes

Hashes for kothaset-1.0.3-py3-none-win_amd64.whl
Algorithm Hash digest
SHA256 604e292741e9441f5c1909fcf724a8ba9f7ba5b5e229060232b631b14e24ed22
MD5 a66a221fa69f357798b7b19e77f8c167
BLAKE2b-256 9da35480ef976d8b4a5ce34fc7eb51ebedb1420de6ff594e4d9de3f835d02526

See more details on using hashes here.

Provenance

The following attestation bundles were made for kothaset-1.0.3-py3-none-win_amd64.whl:

Publisher: release.yml on shantoislamdev/kothaset

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file kothaset-1.0.3-py3-none-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.

File metadata

File hashes

Hashes for kothaset-1.0.3-py3-none-manylinux_2_17_x86_64.manylinux2014_x86_64.whl
Algorithm Hash digest
SHA256 a6668f9307bdf3dfe34f5a1fe01f80b58e8e6a5571115d8a7cff06444c77dd53
MD5 956559f43ca20cfc0aa1b2295fea4697
BLAKE2b-256 dbbbe3072fde97d6ceacdfcdcfbbd7e4c98458c1f5769a0da64204f18e8c3d80

See more details on using hashes here.

Provenance

The following attestation bundles were made for kothaset-1.0.3-py3-none-manylinux_2_17_x86_64.manylinux2014_x86_64.whl:

Publisher: release.yml on shantoislamdev/kothaset

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file kothaset-1.0.3-py3-none-manylinux_2_17_aarch64.manylinux2014_aarch64.whl.

File metadata

File hashes

Hashes for kothaset-1.0.3-py3-none-manylinux_2_17_aarch64.manylinux2014_aarch64.whl
Algorithm Hash digest
SHA256 308bd5ce6569a034dd364f0ff66e2c69c62bc73863d0ab9c9a6da94ce839ed30
MD5 514048007340a6ad51ecd48e01209613
BLAKE2b-256 804ea0def77c3ee29fe4ddfecad27822bb6f44b538696f2413d7e40f837883b4

See more details on using hashes here.

Provenance

The following attestation bundles were made for kothaset-1.0.3-py3-none-manylinux_2_17_aarch64.manylinux2014_aarch64.whl:

Publisher: release.yml on shantoislamdev/kothaset

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file kothaset-1.0.3-py3-none-macosx_11_0_arm64.whl.

File metadata

File hashes

Hashes for kothaset-1.0.3-py3-none-macosx_11_0_arm64.whl
Algorithm Hash digest
SHA256 171e29462cfe5e694b91393c3b1e4623375c9f429efd08fce12536dd35db12ed
MD5 f558020875dac2dd3ad8909e0e368667
BLAKE2b-256 93cb354e133e48c320dd4e0b97178e8bc4d8e7912ac616a9ff89bbf48af44998

See more details on using hashes here.

Provenance

The following attestation bundles were made for kothaset-1.0.3-py3-none-macosx_11_0_arm64.whl:

Publisher: release.yml on shantoislamdev/kothaset

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file kothaset-1.0.3-py3-none-macosx_10_12_x86_64.whl.

File metadata

File hashes

Hashes for kothaset-1.0.3-py3-none-macosx_10_12_x86_64.whl
Algorithm Hash digest
SHA256 c002dcf75285843745e2dd08d995c3f68adb88045700b97e21ae35a7b8a7b530
MD5 abaa3ce832a09044da6875bbd9562f96
BLAKE2b-256 c86ee2e6fcd0f58a1ba8f2c601c1998c8f849dc9c722147ad1fca6ac9f2aeafe

See more details on using hashes here.

Provenance

The following attestation bundles were made for kothaset-1.0.3-py3-none-macosx_10_12_x86_64.whl:

Publisher: release.yml on shantoislamdev/kothaset

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page