Skip to main content

A Python package for syntesize datasets for training and fine-tuning AI models.

Project description

tweaktune

tweaktune is a Rust-powered, Python-facing library designed to synthesize datasets for training and fine-tuning AI models, especially LMs (Language Models).
It allows you to easily build data pipelines, generate new examples using LLM APIs, and create structured datasets from a variety of sources.


Features

  • Flexible Data Sources:
    Supports datasets from:

    • Parquet files
    • CSV files
    • JSONL files
    • Arrow datasets
    • OpenAPI specifications (for function calling datasets)
    • Lists of tools (Python functions for function calling datasets)
    • Pydantic models (for structured output datasets)
  • LLM Integration:
    Connects to any LLM API to generate synthetic text or structured JSON.

  • Dynamic Prompting:
    Supports Jinja templates for highly customizable prompts.

  • Parallel Processing:
    Configure multiple workers to run your pipeline steps in parallel.

  • Easy Pipeline Building:
    Compose steps like sampling, generating, writing, or debugging into a seamless pipeline.


Quick Example

Here's how you can build a dataset from a Parquet file and synthesize new data using an LLM API:

from tweaktune import Pipeline
import os

persona_template = """
Na podstawie poniższego fragmentu tekstu opisz personę która jest z nim związana.
Dla opisywanej osoby wymyśl fikcyjne imię i nazwisko.
Napisz dwa zdania na temat tej osoby, opis zwróć w formacie json, nie dodawaj nic więcej:
{"persona":"opis osoby"}

---
FRAGMENT TEKSTU:

{{article[0].text}}
"""

url = "http://localhost:8000/"
api_key = os.environ["API_KEY"]
model = "model"

p = Pipeline()\
    .with_workers(5)\
    .with_parquet_dataset("web_articles", "../../datasets/articles.pq")\
    .with_llm_api("bielik", url, api_key, model)\
    .with_template("persona", persona_template)\
    .with_template("output", """{"persona": {{persona|jstr}} }""")\
    .iter_range(10000)\
        .sample(dataset="web_articles", size=1, output="article")\
        .generate_json(template="persona", llm="bielik", output="persona", json_path="persona")\
        .write_jsonl(path="../../datasets/personas.jsonl", template="output")\
    .run()

Pipeline Steps

You can easily chain together multiple steps:

  • sample() – sample items from a dataset
  • read() – read entire dataset
  • generate_text() – generate text using an LLM
  • generate_json() – generate JSON output and extract a specific field
  • write_jsonl() – write output to a JSONL file
  • write_csv() – write output to a CSV file
  • print() – print outputs
  • debug() – enable detailed debugging
  • log() – set log level
  • python step – add custom Python-defined step classes

Why tweaktune?

  • Build synthetic datasets faster for fine-tuning models.
  • Automate text, JSON, or structured data generation.
  • Stay flexible: plug your own LLM API or use existing OpenAI-compatible ones.
  • Rust speed, Python usability.

📦 Installation

pip install tweaktune

🤝 Contributing

We welcome contributions! Feel free to open issues, suggest features, or create pull requests.

Please note that by contributing to this project, you agree to the terms of the Contributor License Agreement (CLA).

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distributions

No source distribution files available for this release.See tutorial on generating distribution archives.

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

tweaktune-0.0.1a9-cp38-abi3-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (32.4 MB view details)

Uploaded CPython 3.8+manylinux: glibc 2.17+ x86-64

File details

Details for the file tweaktune-0.0.1a9-cp38-abi3-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.

File metadata

File hashes

Hashes for tweaktune-0.0.1a9-cp38-abi3-manylinux_2_17_x86_64.manylinux2014_x86_64.whl
Algorithm Hash digest
SHA256 e4b8d5bbe447b5b604745b773165c316b527b8325f9a3c203c1c922784b28500
MD5 fb1b5a47ca87a2828c42f7053f781049
BLAKE2b-256 8ac71a4e45a31e39a69a78600579830611bac51eab6b5a05decbf005864aa466

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page