Skip to main content

Generate high-quality LLM training datasets from documents with distillation and augmentation.

Project description

FastDatasets

Generate high-quality LLM training datasets from documents. Distillation, augmentation, multi-format export.

Install

pip install fastdatasets
# Optional extras:
# pip install 'fastdatasets[web]'   # Web UI / API
# pip install 'fastdatasets[doc]'   # Better doc parsing (textract)
# pip install 'fastdatasets[all]'   # Everything

Configure LLM

Use environment variables or pass parameters directly (function args override env):

export LLM_API_KEY="sk-..."
export LLM_API_BASE="https://api.example.com/v1"
export LLM_MODEL="your-model"

Quick Start (Python)

from fastdatasets import generate_dataset_to_dir

dataset = generate_dataset_to_dir(
  inputs=["./docs", "./data/sample.txt"],
  output_dir="./output",
  formats=["alpaca", "sharegpt"],
  file_format="jsonl",
  chunk_size=1000,
  chunk_overlap=200,
  enable_cot=False,
  max_llm_concurrency=5,
  # api_key="sk-...", api_base="https://api.example.com/v1", model_name="your-model",
)
print(len(dataset))

CLI

# Core usage
fastdatasets generate ./data -o ./output -f alpaca,sharegpt --file-format jsonl

# Override LLM just for this command
LLM_API_KEY=sk-xxx LLM_API_BASE=https://api.example.com/v1 LLM_MODEL=your-model \
  fastdatasets generate ./docs -o ./out

Optional Features

  • Web/API: pip install 'fastdatasets[web]' then run your web/app code
  • Better doc parsing (PDF/DOCX): pip install 'fastdatasets[doc]'

Links

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

fastdatasets_llm-0.1.3.tar.gz (53.4 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

fastdatasets_llm-0.1.3-py3-none-any.whl (57.2 kB view details)

Uploaded Python 3

File details

Details for the file fastdatasets_llm-0.1.3.tar.gz.

File metadata

  • Download URL: fastdatasets_llm-0.1.3.tar.gz
  • Upload date:
  • Size: 53.4 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.1.0 CPython/3.10.12

File hashes

Hashes for fastdatasets_llm-0.1.3.tar.gz
Algorithm Hash digest
SHA256 3763c13ddf63d6a47fd5a6b2509257c9b9659ce48a5714f4acc4aeb1535ed5a9
MD5 d6eec0146af98ce3420ed8bdb48ae6de
BLAKE2b-256 b3d625da548ac63d34dce1a4cc790ab9e5b15bb207db1623ec49015b8e83865d

See more details on using hashes here.

File details

Details for the file fastdatasets_llm-0.1.3-py3-none-any.whl.

File metadata

File hashes

Hashes for fastdatasets_llm-0.1.3-py3-none-any.whl
Algorithm Hash digest
SHA256 0551789c7a6501cbf3b7d2bf6de7737004bdc6280e01f8e7d936f9c4e6caafc2
MD5 ed8c844b89182d1e879c2fc0bd579443
BLAKE2b-256 9c0a48203bf916a267d0ea59e0a26d68a34d38b4704dd34263fefa097ec56cc0

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page