Skip to main content

High-quality dataset generation CLI for LLM training

Project description

KothaSet

Go Version npm version PyPI version License

KothaSet is a powerful CLI tool for generating high-quality datasets using Large Language Models (LLMs) as teacher models. Create diverse training data for fine-tuning smaller models.

Features

  • Multi-Provider — OpenAI, and OpenAI-compatible APIs (DeepSeek, vLLM, Ollama)
  • Flexible Schemas — Instruction (Alpaca), Chat (ShareGPT), Preference (DPO), Classification
  • Streaming Output — Real-time generation with progress tracking
  • Resumable — Atomic checkpointing, never lose progress
  • Multiple Formats — JSONL, Native Parquet, HuggingFace datasets
  • Reproducible — Required seed for deterministic LLM generation
  • Diversity Control — Input files for sequential topic coverage
  • Validation — Validate configs, schemas, datasets, and provider connectivity

Installation

pip (Python)

pip install kothaset

npm (Node.js)

npm install -g kothaset

Homebrew (macOS/Linux)

brew install shantoislamdev/tap/kothaset

Binary Download

Download from GitHub Releases.

From Source

go install github.com/shantoislamdev/kothaset/cmd/kothaset@latest

Quick Start

  1. Initialize configuration:

    kothaset init
    
  2. Set your API key:

    # Windows PowerShell
    $env:OPENAI_API_KEY = "sk-..."
    
    # Linux/macOS
    export OPENAI_API_KEY="sk-..."
    
  3. Generate a dataset:

    kothaset generate -n 100 -s instruction --seed 42 -i topics.txt -o dataset.jsonl
    

Configuration

KothaSet uses a two-file configuration system for better security and organization:

1. kothaset.yaml (Public)

Contains shared settings, context, and instructions. Safe to commit to git.

version: "1.0"
global:
  provider: openai
  schema: instruction
  model: gpt-5.2
  concurrency: 4
  output_dir: ./output
  checkpoint_every: 10  # Save checkpoint every N samples (default: 10)

# Context: Background info or persona injected into every prompt
context: |
  Generate high-quality training data for an AI assistant.
  The data should be helpful, accurate, and well-formatted.

# Instructions: Specific rules and guidelines for generation
instructions:
  - Be creative and diverse in topics and approaches
  - Vary the style and complexity of responses
  - Use clear and concise language

2. .secrets.yaml (Private)

Contains sensitive provider credentials. Add this to your .gitignore!

providers:
  - name: openai
    type: openai
    api_key: env.OPENAI_API_KEY  # Reads from environment variable
    # api_key: sk-...            # Or hardcode key directly
    timeout: 1m
    rate_limit:
      requests_per_minute: 60

  # Custom endpoint example (DeepSeek, vLLM)
  - name: local
    type: openai
    base_url: http://localhost:8000/v1
    api_key: not-needed

Usage

Selecting a Schema

Schema Description Use Case
instruction Alpaca-style {instruction, input, output} SFT
chat ShareGPT multi-turn conversations Chat fine-tuning
preference {prompt, chosen, rejected} pairs DPO/RLHF
classification {text, label} pairs Classifiers
# Instruction dataset
kothaset generate -n 1000 -s instruction --seed 42 -i topics.txt -o instructions.jsonl

# Chat conversations
kothaset generate -n 500 -s chat --seed 123 -i conversations.txt -o conversations.jsonl

# Preference pairs for DPO  
kothaset generate -n 500 -s preference --seed 456 -i pairs.txt -o dpo_data.jsonl

Output Formats

# JSONL (default)
kothaset generate -n 100 --seed 42 -i topics.txt -f jsonl -o dataset.jsonl

# Parquet (native binary parquet output)
kothaset generate -n 100 --seed 42 -i topics.txt -f parquet -o dataset.parquet

# HuggingFace datasets format
kothaset generate -n 100 --seed 42 -i topics.txt -f hf -o ./my_dataset

Advanced Options

# Use custom provider
kothaset generate -n 100 --seed 42 -i topics.txt -p local -o dataset.jsonl

# Control diversity with input file
kothaset generate -n 1000 --seed 42 -i topics.txt -o diverse.jsonl

# Resume interrupted generation
kothaset generate --resume .kothaset/dataset.jsonl.checkpoint

# Dry run (validate config)
kothaset generate --dry-run -n 100 --seed 42 -i topics.txt

Documentation

Getting Started

Reference

Help


Contributing

Contributions welcome! See CONTRIBUTING.md.

License

Apache 2.0 License. See LICENSE.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distributions

No source distribution files available for this release.See tutorial on generating distribution archives.

Built Distributions

If you're not sure about the file name format, learn more about wheel file names.

kothaset-1.1.1-py3-none-win_amd64.whl (4.4 MB view details)

Uploaded Python 3Windows x86-64

kothaset-1.1.1-py3-none-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (4.3 MB view details)

Uploaded Python 3manylinux: glibc 2.17+ x86-64

kothaset-1.1.1-py3-none-manylinux_2_17_aarch64.manylinux2014_aarch64.whl (3.9 MB view details)

Uploaded Python 3manylinux: glibc 2.17+ ARM64

kothaset-1.1.1-py3-none-macosx_11_0_arm64.whl (4.0 MB view details)

Uploaded Python 3macOS 11.0+ ARM64

kothaset-1.1.1-py3-none-macosx_10_12_x86_64.whl (4.4 MB view details)

Uploaded Python 3macOS 10.12+ x86-64

File details

Details for the file kothaset-1.1.1-py3-none-win_amd64.whl.

File metadata

  • Download URL: kothaset-1.1.1-py3-none-win_amd64.whl
  • Upload date:
  • Size: 4.4 MB
  • Tags: Python 3, Windows x86-64
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.7

File hashes

Hashes for kothaset-1.1.1-py3-none-win_amd64.whl
Algorithm Hash digest
SHA256 66a25635302a672f5dae9d4ced9f5530ea31edd43803487be3813875ca1e9b75
MD5 e13940089661fa3b6d18e6cda620c785
BLAKE2b-256 413dffbeca0e01323516860adb87f49b8387f9c7589051dc7477f2e73946afcf

See more details on using hashes here.

Provenance

The following attestation bundles were made for kothaset-1.1.1-py3-none-win_amd64.whl:

Publisher: release.yml on shantoislamdev/kothaset

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file kothaset-1.1.1-py3-none-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.

File metadata

File hashes

Hashes for kothaset-1.1.1-py3-none-manylinux_2_17_x86_64.manylinux2014_x86_64.whl
Algorithm Hash digest
SHA256 407a1e5713b7df1c2291ad3fe8aac896d061c008b08b3d252a6cc2979fb4e1f4
MD5 04f13be887293243a233d53cc9075b93
BLAKE2b-256 ad1b313d3ecc0328b83a0cce700d6a92ebd236e17a6e13e0d6c9a7fceb5442d7

See more details on using hashes here.

Provenance

The following attestation bundles were made for kothaset-1.1.1-py3-none-manylinux_2_17_x86_64.manylinux2014_x86_64.whl:

Publisher: release.yml on shantoislamdev/kothaset

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file kothaset-1.1.1-py3-none-manylinux_2_17_aarch64.manylinux2014_aarch64.whl.

File metadata

File hashes

Hashes for kothaset-1.1.1-py3-none-manylinux_2_17_aarch64.manylinux2014_aarch64.whl
Algorithm Hash digest
SHA256 7fa9bfa4156da79a52c55ece8510eb6c4d9431c37c323f0008b6d0a1855b7aa9
MD5 22da1a8c72d3e97c37340c419be012b9
BLAKE2b-256 9e234254afb73b29a60c0378b5a3e62c7c3a716a1349bb9b08dbb779d33bae63

See more details on using hashes here.

Provenance

The following attestation bundles were made for kothaset-1.1.1-py3-none-manylinux_2_17_aarch64.manylinux2014_aarch64.whl:

Publisher: release.yml on shantoislamdev/kothaset

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file kothaset-1.1.1-py3-none-macosx_11_0_arm64.whl.

File metadata

File hashes

Hashes for kothaset-1.1.1-py3-none-macosx_11_0_arm64.whl
Algorithm Hash digest
SHA256 a80f3c31ec1834e773fbe8e4b83c1f290d50e90fb222a5650b3ee810c11f9bdf
MD5 f94917965c0b8524cda5c1482da966bf
BLAKE2b-256 a9ac1e30e7111f6038b02d709df309de83220ef28afb169f2ae8575a4fb5b3b4

See more details on using hashes here.

Provenance

The following attestation bundles were made for kothaset-1.1.1-py3-none-macosx_11_0_arm64.whl:

Publisher: release.yml on shantoislamdev/kothaset

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file kothaset-1.1.1-py3-none-macosx_10_12_x86_64.whl.

File metadata

File hashes

Hashes for kothaset-1.1.1-py3-none-macosx_10_12_x86_64.whl
Algorithm Hash digest
SHA256 f2bb4e777f252bed42a3fedb8cfbd222fad747983ff4e8519b6cb57458756816
MD5 cbc3a6f8b0abf434b3a98843104b92a4
BLAKE2b-256 82b905c3e969ab472ae7117baceefa73331fcbd192406be713cb64bea10bcc5f

See more details on using hashes here.

Provenance

The following attestation bundles were made for kothaset-1.1.1-py3-none-macosx_10_12_x86_64.whl:

Publisher: release.yml on shantoislamdev/kothaset

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page