Skip to main content

A Python package for syntesize datasets for training and fine-tuning AI models.

Project description

tweaktune

tweaktune is a Rust-powered, Python-facing library designed to synthesize datasets for training and fine-tuning AI models, especially LMs (Language Models).
It allows you to easily build data pipelines, generate new examples using LLM APIs, and create structured datasets from a variety of sources.


Features

  • Flexible Data Sources:
    Supports datasets from:

    • Parquet files
    • CSV files
    • JSONL files
    • Arrow datasets
    • OpenAPI specifications (for function calling datasets)
    • Lists of tools (Python functions for function calling datasets)
    • Pydantic models (for structured output datasets)
  • LLM Integration:
    Connects to any LLM API to generate synthetic text or structured JSON.

  • Dynamic Prompting:
    Supports Jinja templates for highly customizable prompts.

  • Parallel Processing:
    Configure multiple workers to run your pipeline steps in parallel.

  • Easy Pipeline Building:
    Compose steps like sampling, generating, writing, or debugging into a seamless pipeline.


Quick Example

Here's how you can build a dataset from a Parquet file and synthesize new data using an LLM API:

from tweaktune import Pipeline
import os

persona_template = """
Na podstawie poniższego fragmentu tekstu opisz personę która jest z nim związana.
Dla opisywanej osoby wymyśl fikcyjne imię i nazwisko.
Napisz dwa zdania na temat tej osoby, opis zwróć w formacie json, nie dodawaj nic więcej:
{"persona":"opis osoby"}

---
FRAGMENT TEKSTU:

{{article[0].text}}
"""

url = "http://localhost:8000/"
api_key = os.environ["API_KEY"]
model = "model"

p = Pipeline()\
    .with_workers(5)\
    .with_parquet_dataset("web_articles", "../../datasets/articles.pq")\
    .with_llm_api("bielik", url, api_key, model)\
    .with_template("persona", persona_template)\
    .with_template("output", """{"persona": {{persona|jstr}} }""")\
    .iter_range(10000)\
        .sample(dataset="web_articles", size=1, output="article")\
        .generate_json(template="persona", llm="bielik", output="persona", json_path="persona")\
        .write_jsonl(path="../../datasets/personas.jsonl", template="output")\
    .run()

Pipeline Steps

You can easily chain together multiple steps:

  • sample() – sample items from a dataset
  • read() – read entire dataset
  • generate_text() – generate text using an LLM
  • generate_json() – generate JSON output and extract a specific field
  • write_jsonl() – write output to a JSONL file
  • write_csv() – write output to a CSV file
  • print() – print outputs
  • debug() – enable detailed debugging
  • log() – set log level
  • python step – add custom Python-defined step classes

Why tweaktune?

  • Build synthetic datasets faster for fine-tuning models.
  • Automate text, JSON, or structured data generation.
  • Stay flexible: plug your own LLM API or use existing OpenAI-compatible ones.
  • Rust speed, Python usability.

📦 Installation

pip install tweaktune

🤝 Contributing

We welcome contributions! Feel free to open issues, suggest features, or create pull requests.

Please note that by contributing to this project, you agree to the terms of the Contributor License Agreement (CLA).

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distributions

No source distribution files available for this release.See tutorial on generating distribution archives.

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

tweaktune-0.0.1a8-cp38-abi3-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (32.5 MB view details)

Uploaded CPython 3.8+manylinux: glibc 2.17+ x86-64

File details

Details for the file tweaktune-0.0.1a8-cp38-abi3-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.

File metadata

File hashes

Hashes for tweaktune-0.0.1a8-cp38-abi3-manylinux_2_17_x86_64.manylinux2014_x86_64.whl
Algorithm Hash digest
SHA256 af33874f23cc1ad543eec1a9c119c08267c0585117dc07148bc734512dcc7f70
MD5 5ab35c221adfc17b005f7232ad67a9bb
BLAKE2b-256 6cf1c3de0a380b8753ca115889c152ae6f5ae87dd3b4c5c48b190de5cae2d00e

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page