Skip to main content

High-quality dataset generation CLI for LLM training

Project description

KothaSet

Go Version npm version PyPI version License

KothaSet is a powerful CLI tool for generating high-quality datasets using Large Language Models (LLMs) as teacher models. Create diverse training data for fine-tuning smaller models.

Features

  • Multi-Provider — OpenAI, and OpenAI-compatible APIs (DeepSeek, vLLM, Ollama)
  • Flexible Schemas — Instruction (Alpaca), Chat (ShareGPT), Preference (DPO), Classification
  • Streaming Output — Real-time generation with progress tracking
  • Resumable — Atomic checkpointing, never lose progress
  • Multiple Formats — JSONL, Native Parquet, HuggingFace datasets
  • Reproducible — Required seed for deterministic LLM generation
  • Diversity Control — Input files for sequential topic coverage
  • Validation — Validate configs, schemas, datasets, and provider connectivity

Installation

pip (Python)

pip install kothaset

npm (Node.js)

npm install -g kothaset

Homebrew (macOS/Linux)

brew install shantoislamdev/tap/kothaset

Binary Download

Download from GitHub Releases.

From Source

go install github.com/shantoislamdev/kothaset/cmd/kothaset@latest

Quick Start

  1. Initialize configuration:

    kothaset init
    
  2. Set your API key:

    # Windows PowerShell
    $env:OPENAI_API_KEY = "sk-..."
    
    # Linux/macOS
    export OPENAI_API_KEY="sk-..."
    
  3. Generate a dataset:

    kothaset generate -n 100 -s instruction --seed 42 -i topics.txt -o dataset.jsonl
    

Configuration

KothaSet uses a two-file configuration system for better security and organization:

1. kothaset.yaml (Public)

Contains shared settings, context, and instructions. Safe to commit to git.

version: "1.0"
global:
  provider: openai
  schema: instruction
  model: gpt-5.2
  concurrency: 4
  output_dir: ./output

# Context: Background info or persona injected into every prompt
context: |
  Generate high-quality training data for an AI assistant.
  The data should be helpful, accurate, and well-formatted.

# Instructions: Specific rules and guidelines for generation
instructions:
  - Be creative and diverse in topics and approaches
  - Vary the style and complexity of responses
  - Use clear and concise language

2. .secrets.yaml (Private)

Contains sensitive provider credentials. Add this to your .gitignore!

providers:
  - name: openai
    type: openai
    api_key: env.OPENAI_API_KEY  # Reads from environment variable
    # api_key: sk-...            # Or hardcode key directly
    timeout: 1m
    rate_limit:
      requests_per_minute: 60

  # Custom endpoint example (DeepSeek, vLLM)
  - name: local
    type: openai
    base_url: http://localhost:8000/v1
    api_key: not-needed

Usage

Selecting a Schema

Schema Description Use Case
instruction Alpaca-style {instruction, input, output} SFT
chat ShareGPT multi-turn conversations Chat fine-tuning
preference {prompt, chosen, rejected} pairs DPO/RLHF
classification {text, label} pairs Classifiers
# Instruction dataset
kothaset generate -n 1000 -s instruction --seed 42 -i topics.txt -o instructions.jsonl

# Chat conversations
kothaset generate -n 500 -s chat --seed 123 -i conversations.txt -o conversations.jsonl

# Preference pairs for DPO  
kothaset generate -n 500 -s preference --seed 456 -i pairs.txt -o dpo_data.jsonl

Output Formats

# JSONL (default)
kothaset generate -n 100 --seed 42 -i topics.txt -f jsonl -o dataset.jsonl

# Parquet
kothaset generate -n 100 --seed 42 -i topics.txt -f parquet -o dataset.parquet

# HuggingFace datasets format
kothaset generate -n 100 --seed 42 -i topics.txt -f hf -o ./my_dataset

Advanced Options

# Use custom provider
kothaset generate -n 100 --seed 42 -i topics.txt -p local -o dataset.jsonl

# Control diversity with input file
kothaset generate -n 1000 --seed 42 -i topics.txt -o diverse.jsonl

# Resume interrupted generation
kothaset generate --resume dataset.jsonl.checkpoint

# Dry run (validate config)
kothaset generate --dry-run -n 100 --seed 42 -i topics.txt

Documentation

Getting Started

Reference

Help


Contributing

Contributions welcome! See CONTRIBUTING.md.

License

Apache 2.0 License. See LICENSE.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distributions

No source distribution files available for this release.See tutorial on generating distribution archives.

Built Distributions

If you're not sure about the file name format, learn more about wheel file names.

kothaset-1.0.1-py3-none-win_amd64.whl (4.3 MB view details)

Uploaded Python 3Windows x86-64

kothaset-1.0.1-py3-none-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (4.2 MB view details)

Uploaded Python 3manylinux: glibc 2.17+ x86-64

kothaset-1.0.1-py3-none-manylinux_2_17_aarch64.manylinux2014_aarch64.whl (3.8 MB view details)

Uploaded Python 3manylinux: glibc 2.17+ ARM64

kothaset-1.0.1-py3-none-macosx_11_0_arm64.whl (4.0 MB view details)

Uploaded Python 3macOS 11.0+ ARM64

kothaset-1.0.1-py3-none-macosx_10_12_x86_64.whl (4.3 MB view details)

Uploaded Python 3macOS 10.12+ x86-64

File details

Details for the file kothaset-1.0.1-py3-none-win_amd64.whl.

File metadata

  • Download URL: kothaset-1.0.1-py3-none-win_amd64.whl
  • Upload date:
  • Size: 4.3 MB
  • Tags: Python 3, Windows x86-64
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.7

File hashes

Hashes for kothaset-1.0.1-py3-none-win_amd64.whl
Algorithm Hash digest
SHA256 8ee5fb8ac3d4a19d46c441c5965df94a575be54e04d1534812e7283809595ded
MD5 c36938da80a4a3f0c52166e7a7c6de03
BLAKE2b-256 ce8ae34aa55b7541d12fc88d95f156f4597adaab0581f309adaa1967f7443fca

See more details on using hashes here.

Provenance

The following attestation bundles were made for kothaset-1.0.1-py3-none-win_amd64.whl:

Publisher: release.yml on shantoislamdev/kothaset

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file kothaset-1.0.1-py3-none-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.

File metadata

File hashes

Hashes for kothaset-1.0.1-py3-none-manylinux_2_17_x86_64.manylinux2014_x86_64.whl
Algorithm Hash digest
SHA256 de90557eb1bb964bbc7b0da6bb20959c6becccbdac0d009a0f9faa7f8792b81b
MD5 a0b30360de7128d8f304e3eb09651f12
BLAKE2b-256 dcf1031d4a1e92f6eef394a97066c88c209e1557310a6ab11580b53a205849a2

See more details on using hashes here.

Provenance

The following attestation bundles were made for kothaset-1.0.1-py3-none-manylinux_2_17_x86_64.manylinux2014_x86_64.whl:

Publisher: release.yml on shantoislamdev/kothaset

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file kothaset-1.0.1-py3-none-manylinux_2_17_aarch64.manylinux2014_aarch64.whl.

File metadata

File hashes

Hashes for kothaset-1.0.1-py3-none-manylinux_2_17_aarch64.manylinux2014_aarch64.whl
Algorithm Hash digest
SHA256 0634ba8efe2d0b368ac9df14e0198e2028715ebac139d47c94c3ece82dcce556
MD5 84d5c699999cdc34a562bbf2a16b3652
BLAKE2b-256 9cee11b786d60092018e25ee1eda166f3f686dd96b56874f7c9f7ee65b447e4d

See more details on using hashes here.

Provenance

The following attestation bundles were made for kothaset-1.0.1-py3-none-manylinux_2_17_aarch64.manylinux2014_aarch64.whl:

Publisher: release.yml on shantoislamdev/kothaset

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file kothaset-1.0.1-py3-none-macosx_11_0_arm64.whl.

File metadata

File hashes

Hashes for kothaset-1.0.1-py3-none-macosx_11_0_arm64.whl
Algorithm Hash digest
SHA256 22a746638017fb6c4d6ad35418d350dfa0c888bb46723e2502fe9963e187f6bd
MD5 61720b686df0f6289650721b05fd759b
BLAKE2b-256 57487ce6cc4f800762d6bb44572f6eb985c63338ad75297a359ab7da44442d37

See more details on using hashes here.

Provenance

The following attestation bundles were made for kothaset-1.0.1-py3-none-macosx_11_0_arm64.whl:

Publisher: release.yml on shantoislamdev/kothaset

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file kothaset-1.0.1-py3-none-macosx_10_12_x86_64.whl.

File metadata

File hashes

Hashes for kothaset-1.0.1-py3-none-macosx_10_12_x86_64.whl
Algorithm Hash digest
SHA256 09ca7a79b470b6a2594fedaec2f8c86a0723647586c007a749ed4be01f1209d7
MD5 97f2ba2cec6db474a97064c0d7502daa
BLAKE2b-256 6d58777d91337e0f5209787e1f5fad8264ad1f3b4fb24cc4146ba170709a42d9

See more details on using hashes here.

Provenance

The following attestation bundles were made for kothaset-1.0.1-py3-none-macosx_10_12_x86_64.whl:

Publisher: release.yml on shantoislamdev/kothaset

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page