Skip to main content

A Python package for syntesize datasets for training and fine-tuning AI models.

Project description

tweaktune

tweaktune is a Rust-powered, Python-facing library designed to synthesize datasets for training and fine-tuning AI models, especially LMs (Language Models).
It allows you to easily build data pipelines, generate new examples using LLM APIs, and create structured datasets from a variety of sources.


Features

  • Flexible Data Sources:
    Supports datasets from:

    • Parquet files
    • CSV files
    • JSONL files
    • Arrow datasets
    • OpenAPI specifications (for function calling datasets)
    • Lists of tools (Python functions for function calling datasets)
    • Pydantic models (for structured output datasets)
  • LLM Integration:
    Connects to any LLM API to generate synthetic text or structured JSON.

  • Dynamic Prompting:
    Supports Jinja templates for highly customizable prompts.

  • Parallel Processing:
    Configure multiple workers to run your pipeline steps in parallel.

  • Easy Pipeline Building:
    Compose steps like sampling, generating, writing, or debugging into a seamless pipeline.


Quick Example

Here's how you can build a dataset from a Parquet file and synthesize new data using an LLM API:

from tweaktune import Pipeline
import os

persona_template = """
Na podstawie poniższego fragmentu tekstu opisz personę która jest z nim związana.
Dla opisywanej osoby wymyśl fikcyjne imię i nazwisko.
Napisz dwa zdania na temat tej osoby, opis zwróć w formacie json, nie dodawaj nic więcej:
{"persona":"opis osoby"}

---
FRAGMENT TEKSTU:

{{article[0].text}}
"""

url = "http://localhost:8000/"
api_key = os.environ["API_KEY"]
model = "model"

p = Pipeline()\
    .with_workers(5)\
    .with_parquet_dataset("web_articles", "../../datasets/articles.pq")\
    .with_openai_llm("bielik", url, api_key, model)\
    .with_template("persona", persona_template)\
    .with_template("output", """{"persona": {{persona|jstr}} }""")\
    .iter_range(10000)\
        .sample(dataset="web_articles", size=1, output="article")\
        .generate_json(template="persona", llm="bielik", output="persona", json_path="persona")\
        .write_jsonl(path="../../datasets/personas.jsonl", template="output")\
    .run()

Pipeline Steps

You can easily chain together multiple steps:

  • sample() – sample items from a dataset
  • read() – read entire dataset
  • generate_text() – generate text using an LLM
  • generate_json() – generate JSON output and extract a specific field
  • write_jsonl() – write output to a JSONL file
  • write_csv() – write output to a CSV file
  • print() – print outputs
  • debug() – enable detailed debugging
  • log() – set log level
  • python step – add custom Python-defined step classes

Why tweaktune?

  • Build synthetic datasets faster for fine-tuning models.
  • Automate text, JSON, or structured data generation.
  • Stay flexible: plug your own LLM API or use existing OpenAI-compatible ones.
  • Rust speed, Python usability.

📦 Installation

pip install tweaktune

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distributions

No source distribution files available for this release.See tutorial on generating distribution archives.

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

tweaktune-0.0.1a1-cp38-abi3-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (8.3 MB view details)

Uploaded CPython 3.8+manylinux: glibc 2.17+ x86-64

File details

Details for the file tweaktune-0.0.1a1-cp38-abi3-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.

File metadata

File hashes

Hashes for tweaktune-0.0.1a1-cp38-abi3-manylinux_2_17_x86_64.manylinux2014_x86_64.whl
Algorithm Hash digest
SHA256 3d73f55220dfdb3c5f9aaf97270b0ac7136aa04ed982b721e203377462d9481d
MD5 367a2713a55b2b84409232a267815191
BLAKE2b-256 554dd461b3027f17638d992064295fca37906e6260c58401f07f3a899fd4f52e

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page