Skip to main content

A Python package for syntesize datasets for training and fine-tuning AI models.

Project description

tweaktune

tweaktune is a Rust-powered, Python-facing library designed to synthesize datasets for training and fine-tuning AI models, especially LMs (Language Models).
It allows you to easily build data pipelines, generate new examples using LLM APIs, and create structured datasets from a variety of sources.


Features

  • Flexible Data Sources:
    Supports datasets from:

    • Parquet files
    • CSV files
    • JSONL files
    • Arrow datasets
    • OpenAPI specifications (for function calling datasets)
    • Lists of tools (Python functions for function calling datasets)
    • Pydantic models (for structured output datasets)
  • LLM Integration:
    Connects to any LLM API to generate synthetic text or structured JSON.

  • Dynamic Prompting:
    Supports Jinja templates for highly customizable prompts.

  • Parallel Processing:
    Configure multiple workers to run your pipeline steps in parallel.

  • Easy Pipeline Building:
    Compose steps like sampling, generating, writing, or debugging into a seamless pipeline.


Quick Example

Here's how you can build a dataset from a Parquet file and synthesize new data using an LLM API:

from tweaktune import Pipeline
import os

persona_template = """
Na podstawie poniższego fragmentu tekstu opisz personę która jest z nim związana.
Dla opisywanej osoby wymyśl fikcyjne imię i nazwisko.
Napisz dwa zdania na temat tej osoby, opis zwróć w formacie json, nie dodawaj nic więcej:
{"persona":"opis osoby"}

---
FRAGMENT TEKSTU:

{{article[0].text}}
"""

url = "http://localhost:8000/"
api_key = os.environ["API_KEY"]
model = "model"

p = Pipeline()\
    .with_workers(5)\
    .with_parquet_dataset("web_articles", "../../datasets/articles.pq")\
    .with_openai_llm("bielik", url, api_key, model)\
    .with_template("persona", persona_template)\
    .with_template("output", """{"persona": {{persona|jstr}} }""")\
    .iter_range(10000)\
        .sample(dataset="web_articles", size=1, output="article")\
        .generate_json(template="persona", llm="bielik", output="persona", json_path="persona")\
        .write_jsonl(path="../../datasets/personas.jsonl", template="output")\
    .run()

Pipeline Steps

You can easily chain together multiple steps:

  • sample() – sample items from a dataset
  • read() – read entire dataset
  • generate_text() – generate text using an LLM
  • generate_json() – generate JSON output and extract a specific field
  • write_jsonl() – write output to a JSONL file
  • write_csv() – write output to a CSV file
  • print() – print outputs
  • debug() – enable detailed debugging
  • log() – set log level
  • python step – add custom Python-defined step classes

Why tweaktune?

  • Build synthetic datasets faster for fine-tuning models.
  • Automate text, JSON, or structured data generation.
  • Stay flexible: plug your own LLM API or use existing OpenAI-compatible ones.
  • Rust speed, Python usability.

📦 Installation

pip install tweaktune

🤝 Contributing

We welcome contributions! Feel free to open issues, suggest features, or create pull requests.

Please note that by contributing to this project, you agree to the terms of the Contributor License Agreement (CLA).

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distributions

No source distribution files available for this release.See tutorial on generating distribution archives.

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

tweaktune-0.0.1a3-cp38-abi3-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (8.3 MB view details)

Uploaded CPython 3.8+manylinux: glibc 2.17+ x86-64

File details

Details for the file tweaktune-0.0.1a3-cp38-abi3-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.

File metadata

File hashes

Hashes for tweaktune-0.0.1a3-cp38-abi3-manylinux_2_17_x86_64.manylinux2014_x86_64.whl
Algorithm Hash digest
SHA256 c2ac15a24f956c93491686a1b31170f8e009500efd2572eb49fe2732d8bfb5b5
MD5 55565e9e89bd943b456d6a849f4978cc
BLAKE2b-256 199d371696b71c7355ad933683d3a28528bd3a3a78b65d1489c9bba854181707

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page