Skip to main content

High-quality dataset generation CLI for LLM training

Project description

KothaSet

Go Version npm version PyPI version License

KothaSet is a powerful CLI tool for generating high-quality datasets using Large Language Models (LLMs) as teacher models. Create diverse training data for fine-tuning smaller models.

Features

  • Multi-Provider — OpenAI, and OpenAI-compatible APIs (DeepSeek, vLLM, Ollama)
  • Flexible Schemas — Instruction (Alpaca), Chat (ShareGPT), Preference (DPO), Classification
  • Streaming Output — Real-time generation with progress tracking
  • Resumable — Atomic checkpointing, never lose progress
  • Multiple Formats — JSONL, Native Parquet, HuggingFace datasets
  • Reproducible — Required seed for deterministic LLM generation
  • Diversity Control — Input files for sequential topic coverage
  • Validation — Validate configs, schemas, datasets, and provider connectivity

Installation

pip (Python)

pip install kothaset

npm (Node.js)

npm install -g kothaset

Homebrew (macOS/Linux)

brew install shantoislamdev/tap/kothaset

Binary Download

Download from GitHub Releases.

From Source

go install github.com/shantoislamdev/kothaset/cmd/kothaset@latest

Quick Start

  1. Initialize configuration:

    kothaset init
    
  2. Set your API key:

    # Windows PowerShell
    $env:OPENAI_API_KEY = "sk-..."
    
    # Linux/macOS
    export OPENAI_API_KEY="sk-..."
    
  3. Generate a dataset:

    kothaset generate -n 100 -s instruction --seed 42 -i topics.txt -o dataset.jsonl
    

Configuration

KothaSet uses a two-file configuration system for better security and organization:

1. kothaset.yaml (Public)

Contains shared settings, context, and instructions. Safe to commit to git.

version: "1.0"
global:
  provider: openai
  schema: instruction
  model: gpt-5.2
  concurrency: 4
  output_dir: ./output

# Context: Background info or persona injected into every prompt
context: |
  Generate high-quality training data for an AI assistant.
  The data should be helpful, accurate, and well-formatted.

# Instructions: Specific rules and guidelines for generation
instructions:
  - Be creative and diverse in topics and approaches
  - Vary the style and complexity of responses
  - Use clear and concise language

2. .secrets.yaml (Private)

Contains sensitive provider credentials. Add this to your .gitignore!

providers:
  - name: openai
    type: openai
    api_key: env.OPENAI_API_KEY  # Reads from environment variable
    # api_key: sk-...            # Or hardcode key directly
    timeout: 1m
    rate_limit:
      requests_per_minute: 60

  # Custom endpoint example (DeepSeek, vLLM)
  - name: local
    type: openai
    base_url: http://localhost:8000/v1
    api_key: not-needed

Usage

Selecting a Schema

Schema Description Use Case
instruction Alpaca-style {instruction, input, output} SFT
chat ShareGPT multi-turn conversations Chat fine-tuning
preference {prompt, chosen, rejected} pairs DPO/RLHF
classification {text, label} pairs Classifiers
# Instruction dataset
kothaset generate -n 1000 -s instruction --seed 42 -i topics.txt -o instructions.jsonl

# Chat conversations
kothaset generate -n 500 -s chat --seed 123 -i conversations.txt -o conversations.jsonl

# Preference pairs for DPO  
kothaset generate -n 500 -s preference --seed 456 -i pairs.txt -o dpo_data.jsonl

Output Formats

# JSONL (default)
kothaset generate -n 100 --seed 42 -i topics.txt -f jsonl -o dataset.jsonl

# Parquet
kothaset generate -n 100 --seed 42 -i topics.txt -f parquet -o dataset.parquet

# HuggingFace datasets format
kothaset generate -n 100 --seed 42 -i topics.txt -f hf -o ./my_dataset

Advanced Options

# Use custom provider
kothaset generate -n 100 --seed 42 -i topics.txt -p local -o dataset.jsonl

# Control diversity with input file
kothaset generate -n 1000 --seed 42 -i topics.txt -o diverse.jsonl

# Resume interrupted generation
kothaset generate --resume dataset.jsonl.checkpoint

# Dry run (validate config)
kothaset generate --dry-run -n 100 --seed 42 -i topics.txt

Documentation

Getting Started

Reference

Help


Contributing

Contributions welcome! See CONTRIBUTING.md.

License

Apache 2.0 License. See LICENSE.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distributions

No source distribution files available for this release.See tutorial on generating distribution archives.

Built Distributions

If you're not sure about the file name format, learn more about wheel file names.

kothaset-1.1.0-py3-none-win_amd64.whl (4.4 MB view details)

Uploaded Python 3Windows x86-64

kothaset-1.1.0-py3-none-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (4.3 MB view details)

Uploaded Python 3manylinux: glibc 2.17+ x86-64

kothaset-1.1.0-py3-none-manylinux_2_17_aarch64.manylinux2014_aarch64.whl (3.9 MB view details)

Uploaded Python 3manylinux: glibc 2.17+ ARM64

kothaset-1.1.0-py3-none-macosx_11_0_arm64.whl (4.0 MB view details)

Uploaded Python 3macOS 11.0+ ARM64

kothaset-1.1.0-py3-none-macosx_10_12_x86_64.whl (4.4 MB view details)

Uploaded Python 3macOS 10.12+ x86-64

File details

Details for the file kothaset-1.1.0-py3-none-win_amd64.whl.

File metadata

  • Download URL: kothaset-1.1.0-py3-none-win_amd64.whl
  • Upload date:
  • Size: 4.4 MB
  • Tags: Python 3, Windows x86-64
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.7

File hashes

Hashes for kothaset-1.1.0-py3-none-win_amd64.whl
Algorithm Hash digest
SHA256 c6186e0f13ca1700bd30ee59f78c9b3bddc9789a592fb001271f09e21505c779
MD5 4a67f385eaa4b06b73783db1766206b5
BLAKE2b-256 4cfc83f812f02b5d43a1631f55e092c662b51e8d6035ff4d8a10d7dfa043602b

See more details on using hashes here.

Provenance

The following attestation bundles were made for kothaset-1.1.0-py3-none-win_amd64.whl:

Publisher: release.yml on shantoislamdev/kothaset

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file kothaset-1.1.0-py3-none-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.

File metadata

File hashes

Hashes for kothaset-1.1.0-py3-none-manylinux_2_17_x86_64.manylinux2014_x86_64.whl
Algorithm Hash digest
SHA256 29a5532bc0b6ce78d64b26cf0f1463f69755edc74d4ace0609c1a9f992126d41
MD5 cc7c8b8b6916c29ea39e947e6eab4111
BLAKE2b-256 9f7dc9e454b399caf2aad87e6c79ec81d31c7f01a18a63d0dcd3cda53c2665dd

See more details on using hashes here.

Provenance

The following attestation bundles were made for kothaset-1.1.0-py3-none-manylinux_2_17_x86_64.manylinux2014_x86_64.whl:

Publisher: release.yml on shantoislamdev/kothaset

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file kothaset-1.1.0-py3-none-manylinux_2_17_aarch64.manylinux2014_aarch64.whl.

File metadata

File hashes

Hashes for kothaset-1.1.0-py3-none-manylinux_2_17_aarch64.manylinux2014_aarch64.whl
Algorithm Hash digest
SHA256 08ae141e92959de23a4b7ff7ddc6f360421fbb5ca860680617e455d7ecf47f93
MD5 85e03ca5c82bb45e29a57b7f9f7d4c9d
BLAKE2b-256 bc7383850ba6feb16a09ec0e87abc085baf1757fe16f5a0b952f829ac44d320a

See more details on using hashes here.

Provenance

The following attestation bundles were made for kothaset-1.1.0-py3-none-manylinux_2_17_aarch64.manylinux2014_aarch64.whl:

Publisher: release.yml on shantoislamdev/kothaset

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file kothaset-1.1.0-py3-none-macosx_11_0_arm64.whl.

File metadata

File hashes

Hashes for kothaset-1.1.0-py3-none-macosx_11_0_arm64.whl
Algorithm Hash digest
SHA256 bad99db175c6089f84f2582399d18ef935610f7f19e6e9e015898ee11aa70b40
MD5 09d40eb6330dddf7b4235c4d80b04d44
BLAKE2b-256 0903bb183ca66ca221521f1d35d6fa0dc0532d4ef1a195d9b39cc65ae1d2a9db

See more details on using hashes here.

Provenance

The following attestation bundles were made for kothaset-1.1.0-py3-none-macosx_11_0_arm64.whl:

Publisher: release.yml on shantoislamdev/kothaset

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file kothaset-1.1.0-py3-none-macosx_10_12_x86_64.whl.

File metadata

File hashes

Hashes for kothaset-1.1.0-py3-none-macosx_10_12_x86_64.whl
Algorithm Hash digest
SHA256 6f958d505078f8320a2e4a7b171ca6f4660c49e100e6384a2075b429b16022d1
MD5 8fcc581272d559dbdf42636a021f5655
BLAKE2b-256 3a158d42cbf8af3b322444f68d50b8c4030b996a696570f7c631c1548efa649a

See more details on using hashes here.

Provenance

The following attestation bundles were made for kothaset-1.1.0-py3-none-macosx_10_12_x86_64.whl:

Publisher: release.yml on shantoislamdev/kothaset

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page