A Python package for syntesize datasets for training and fine-tuning AI models.

These details have not been verified by PyPI

Project links

repository

Project description

tweaktune

tweaktune is a Rust-powered, Python-facing library designed to synthesize datasets for training and fine-tuning AI models, especially LMs (Language Models).
It allows you to easily build data pipelines, generate new examples using LLM APIs, and create structured datasets from a variety of sources.

Features

Flexible Data Sources:
Supports datasets from:
- Parquet files
- CSV files
- JSONL files
- Arrow datasets
- OpenAPI specifications (for function calling datasets)
- Lists of tools (Python functions for function calling datasets)
- Pydantic models (for structured output datasets)
LLM Integration:
Connects to any LLM API to generate synthetic text or structured JSON.
Dynamic Prompting:
Supports Jinja templates for highly customizable prompts.
Parallel Processing:
Configure multiple workers to run your pipeline steps in parallel.
Easy Pipeline Building:
Compose steps like sampling, generating, writing, or debugging into a seamless pipeline.

Quick Example

Here's how you can build a dataset from a Parquet file and synthesize new data using an LLM API:

from tweaktune import Pipeline
import os

persona_template = """
Na podstawie poniższego fragmentu tekstu opisz personę która jest z nim związana.
Dla opisywanej osoby wymyśl fikcyjne imię i nazwisko.
Napisz dwa zdania na temat tej osoby, opis zwróć w formacie json, nie dodawaj nic więcej:
{"persona":"opis osoby"}

---
FRAGMENT TEKSTU:

{{article[0].text}}
"""

url = "http://localhost:8000/"
api_key = os.environ["API_KEY"]
model = "model"

p = Pipeline()\
    .with_workers(5)\
    .with_parquet_dataset("web_articles", "../../datasets/articles.pq")\
    .with_llm_api("bielik", url, api_key, model)\
    .with_template("persona", persona_template)\
    .with_template("output", """{"persona": {{persona|jstr}} }""")\
    .iter_range(10000)\
        .sample(dataset="web_articles", size=1, output="article")\
        .generate_json(template="persona", llm="bielik", output="persona", json_path="persona")\
        .write_jsonl(path="../../datasets/personas.jsonl", template="output")\
    .run()

Pipeline Steps

You can easily chain together multiple steps:

sample() – sample items from a dataset
read() – read entire dataset
generate_text() – generate text using an LLM
generate_json() – generate JSON output and extract a specific field
write_jsonl() – write output to a JSONL file
write_csv() – write output to a CSV file
print() – print outputs
debug() – enable detailed debugging
log() – set log level
python step – add custom Python-defined step classes

Why tweaktune?

Build synthetic datasets faster for fine-tuning models.
Automate text, JSON, or structured data generation.
Stay flexible: plug your own LLM API or use existing OpenAI-compatible ones.
Rust speed, Python usability.

📦 Installation

pip install tweaktune

🤝 Contributing

We welcome contributions! Feel free to open issues, suggest features, or create pull requests.

Please note that by contributing to this project, you agree to the terms of the Contributor License Agreement (CLA).

Project details

These details have not been verified by PyPI

Project links

repository

Release history Release notifications | RSS feed

0.0.1a12 pre-release

Oct 17, 2025

0.0.1a11 pre-release

Oct 17, 2025

0.0.1a10 pre-release

Aug 15, 2025

0.0.1a9 pre-release

Jun 16, 2025

This version

0.0.1a8 pre-release

Jun 15, 2025

0.0.1a7 pre-release

Jun 14, 2025

0.0.1a6 pre-release

May 1, 2025

0.0.1a5 pre-release

Apr 29, 2025

0.0.1a4 pre-release

Apr 23, 2025

0.0.1a3 pre-release

Apr 21, 2025

0.0.1a2 pre-release

Apr 21, 2025

0.0.1a1 pre-release

Apr 21, 2025

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distributions

No source distribution files available for this release.See tutorial on generating distribution archives.

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

tweaktune-0.0.1a8-cp38-abi3-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (32.5 MB view details)

Uploaded Jun 15, 2025 CPython 3.8+manylinux: glibc 2.17+ x86-64

File details

Details for the file tweaktune-0.0.1a8-cp38-abi3-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.

File metadata

Download URL: tweaktune-0.0.1a8-cp38-abi3-manylinux_2_17_x86_64.manylinux2014_x86_64.whl
Upload date: Jun 15, 2025
Size: 32.5 MB
Tags: CPython 3.8+, manylinux: glibc 2.17+ x86-64
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.1.0 CPython/3.11.12

File hashes

Hashes for tweaktune-0.0.1a8-cp38-abi3-manylinux_2_17_x86_64.manylinux2014_x86_64.whl
Algorithm	Hash digest
SHA256	`af33874f23cc1ad543eec1a9c119c08267c0585117dc07148bc734512dcc7f70`
MD5	`5ab35c221adfc17b005f7232ad67a9bb`
BLAKE2b-256	`6cf1c3de0a380b8753ca115889c152ae6f5ae87dd3b4c5c48b190de5cae2d00e`

See more details on using hashes here.

tweaktune 0.0.1a8

Navigation

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Project description

tweaktune

Features

Quick Example

Pipeline Steps

Why tweaktune?

📦 Installation

🤝 Contributing

Project details

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distributions

Built Distribution

File details

File metadata

File hashes