Generate high-quality LLM training datasets from documents with distillation and augmentation.
Project description
FastDatasets
Generate high-quality LLM training datasets from documents. Distillation, augmentation, multi-format export.
Install
pip install fastdatasets
# Optional extras:
# pip install 'fastdatasets[web]' # Web UI / API
# pip install 'fastdatasets[doc]' # Better doc parsing (textract)
# pip install 'fastdatasets[all]' # Everything
Configure LLM
Use environment variables or pass parameters directly (function args override env):
export LLM_API_KEY="sk-..."
export LLM_API_BASE="https://api.example.com/v1"
export LLM_MODEL="your-model"
Quick Start (Python)
from fastdatasets import generate_dataset_to_dir
dataset = generate_dataset_to_dir(
inputs=["./docs", "./data/sample.txt"],
output_dir="./output",
formats=["alpaca", "sharegpt"],
file_format="jsonl",
chunk_size=1000,
chunk_overlap=200,
enable_cot=False,
max_llm_concurrency=5,
# api_key="sk-...", api_base="https://api.example.com/v1", model_name="your-model",
)
print(len(dataset))
CLI
# Core usage
fastdatasets generate ./data -o ./output -f alpaca,sharegpt --file-format jsonl
# Override LLM just for this command
LLM_API_KEY=sk-xxx LLM_API_BASE=https://api.example.com/v1 LLM_MODEL=your-model \
fastdatasets generate ./docs -o ./out
Optional Features
- Web/API:
pip install 'fastdatasets[web]'then run your web/app code - Better doc parsing (PDF/DOCX):
pip install 'fastdatasets[doc]'
Links
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
fastdatasets_llm-0.1.3.tar.gz
(53.4 kB
view details)
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file fastdatasets_llm-0.1.3.tar.gz.
File metadata
- Download URL: fastdatasets_llm-0.1.3.tar.gz
- Upload date:
- Size: 53.4 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.1.0 CPython/3.10.12
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
3763c13ddf63d6a47fd5a6b2509257c9b9659ce48a5714f4acc4aeb1535ed5a9
|
|
| MD5 |
d6eec0146af98ce3420ed8bdb48ae6de
|
|
| BLAKE2b-256 |
b3d625da548ac63d34dce1a4cc790ab9e5b15bb207db1623ec49015b8e83865d
|
File details
Details for the file fastdatasets_llm-0.1.3-py3-none-any.whl.
File metadata
- Download URL: fastdatasets_llm-0.1.3-py3-none-any.whl
- Upload date:
- Size: 57.2 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.1.0 CPython/3.10.12
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
0551789c7a6501cbf3b7d2bf6de7737004bdc6280e01f8e7d936f9c4e6caafc2
|
|
| MD5 |
ed8c844b89182d1e879c2fc0bd579443
|
|
| BLAKE2b-256 |
9c0a48203bf916a267d0ea59e0a26d68a34d38b4704dd34263fefa097ec56cc0
|