A Python package for syntesize datasets for training and fine-tuning AI models.
Project description
tweaktune
tweaktune is a Rust-powered, Python-facing library designed to synthesize datasets for training and fine-tuning AI models, especially LMs (Language Models).
It allows you to easily build data pipelines, generate new examples using LLM APIs, and create structured datasets from a variety of sources.
Features
-
Flexible Data Sources:
Supports datasets from:- Parquet files
- CSV files
- JSONL files
- Arrow datasets
- OpenAPI specifications (for function calling datasets)
- Lists of tools (Python functions for function calling datasets)
- Pydantic models (for structured output datasets)
-
LLM Integration:
Connects to any LLM API to generate synthetic text or structured JSON. -
Dynamic Prompting:
Supports Jinja templates for highly customizable prompts. -
Parallel Processing:
Configure multiple workers to run your pipeline steps in parallel. -
Easy Pipeline Building:
Compose steps like sampling, generating, writing, or debugging into a seamless pipeline.
Quick Example
Here's how you can build a dataset from a Parquet file and synthesize new data using an LLM API:
from tweaktune import Pipeline
import os
persona_template = """
Na podstawie poniższego fragmentu tekstu opisz personę która jest z nim związana.
Dla opisywanej osoby wymyśl fikcyjne imię i nazwisko.
Napisz dwa zdania na temat tej osoby, opis zwróć w formacie json, nie dodawaj nic więcej:
{"persona":"opis osoby"}
---
FRAGMENT TEKSTU:
{{article[0].text}}
"""
url = "http://localhost:8000/"
api_key = os.environ["API_KEY"]
model = "model"
p = Pipeline()\
.with_workers(5)\
.with_parquet_dataset("web_articles", "../../datasets/articles.pq")\
.with_openai_llm("bielik", url, api_key, model)\
.with_template("persona", persona_template)\
.with_template("output", """{"persona": {{persona|jstr}} }""")\
.iter_range(10000)\
.sample(dataset="web_articles", size=1, output="article")\
.generate_json(template="persona", llm="bielik", output="persona", json_path="persona")\
.write_jsonl(path="../../datasets/personas.jsonl", template="output")\
.run()
Pipeline Steps
You can easily chain together multiple steps:
sample()– sample items from a datasetread()– read entire datasetgenerate_text()– generate text using an LLMgenerate_json()– generate JSON output and extract a specific fieldwrite_jsonl()– write output to a JSONL filewrite_csv()– write output to a CSV fileprint()– print outputsdebug()– enable detailed debugginglog()– set log levelpython step– add custom Python-defined step classes
Why tweaktune?
- Build synthetic datasets faster for fine-tuning models.
- Automate text, JSON, or structured data generation.
- Stay flexible: plug your own LLM API or use existing OpenAI-compatible ones.
- Rust speed, Python usability.
📦 Installation
pip install tweaktune
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distributions
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file tweaktune-0.0.1a1-cp38-abi3-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.
File metadata
- Download URL: tweaktune-0.0.1a1-cp38-abi3-manylinux_2_17_x86_64.manylinux2014_x86_64.whl
- Upload date:
- Size: 8.3 MB
- Tags: CPython 3.8+, manylinux: glibc 2.17+ x86-64
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.1.0 CPython/3.10.12
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
3d73f55220dfdb3c5f9aaf97270b0ac7136aa04ed982b721e203377462d9481d
|
|
| MD5 |
367a2713a55b2b84409232a267815191
|
|
| BLAKE2b-256 |
554dd461b3027f17638d992064295fca37906e6260c58401f07f3a899fd4f52e
|