Utilities for building code and document fine-tuning datasets.

Project description

fine_tuning_dataset_preparation

Comprehensive toolkit for turning codebases and Markdown documentation into fine-tuning datasets for LLMs. It ships a code pipeline (tree-sitter extraction, instruction generation, optional paraphrasing), a document pipeline (Markdown segmentation, exhaustive Q/A generation, multi-file support), shared exporters for Gemini/OpenAI/Q/A JSONL, and configurable prompts. Tests run on pytest to keep changes safe.

Installation

pip install -e .

Python 3.10+ is recommended. Before running pipelines, set your LLM credentials:

GOOGLE_API_KEY for Gemini (default)
or OPENAI_API_KEY / ANTHROPIC_API_KEY if you switch providers

Code pipeline

Create instruction datasets from a repository (single or multiple projects).

CLI (see examples/run_code_pipeline.py):

python examples/run_code_pipeline.py

Programmatic use:

from fine_tuning_dataset_preparation.code_dataset import PromptConfig
from fine_tuning_dataset_preparation.code_dataset.pipeline import code_pipeline

code_pipeline(
    project_path="path/to/repo",
    multi_project=True,                # treat subfolders as projects
    dataset_path="dataset.jsonl",
    llm_provider="gemini",
    model_name="gemini-2.5-flash",
    instruction_concurrency=8,
    instruction_temperature=0.7,
    prompt_config=PromptConfig(
        instruction_hint="Keep instructions concise and actionable.",
        paraphrase_hint="Return one alternative phrasing.",
    ),
    paraphrase_variations=1,           # optional paraphrasing
    paraphrase_temperature=0.9,
    exports=[
        {"target": "gemini", "output_path": "gemini_dataset.jsonl", "options": {"jsonl": True}},
        {"target": "openai", "output_path": "openai_dataset.jsonl", "options": {"jsonl": True}},
    ],
)

Key arguments: project_path, multi_project, instruction_concurrency, instruction_temperature, prompt_config, optional paraphrase_*, and exports with targets gemini or openai.

Document pipeline

Generate Q/A datasets from Markdown. You can point to a single file, a directory, or a list of files.

CLI (see examples/run_document_pipeline.py):

python examples/run_document_pipeline.py

Programmatic use:

from fine_tuning_dataset_preparation.document_dataset import run_document_pipeline
from fine_tuning_dataset_preparation.document_dataset.dataset import DocumentPromptConfig

run_document_pipeline(
    markdown_dir="docs",               # or markdown_path="file.md" or markdown_paths=["a.md", "b.md"]
    output_path="document_dataset.json",
    min_total_pairs=1,
    llm_provider="gemini",
    model_name="gemini-2.5-pro",
    prompt_config=DocumentPromptConfig(
        system_message="Use only the provided docs; end with attribution.",
        instructions=[
            "Use only supplied documentation fragments.",
            "If missing, say it is not specified.",
            "End every answer with the attribution line.",
        ],
        attribution="Information sourced from ACME Docs © 2025.",
    ),
    exports=[
        {"target": "qa_openai", "output_path": "doc_openai.jsonl", "options": {"jsonl": True}},
        {"target": "qa_gemini", "output_path": "doc_gemini.jsonl", "options": {"jsonl": True}},
        {"target": "qa_jsonl", "output_path": "doc_pairs.jsonl"},
    ],
)

Key arguments: markdown_path or markdown_dir or markdown_paths, min_total_pairs, prompt_config for tone/attribution, and exports with Q/A targets qa_openai, qa_gemini, qa_jsonl.

Exporters

All exporters live in fine_tuning_dataset_preparation/common/exporters. Use them directly or via the pipeline exports argument.

from fine_tuning_dataset_preparation.common.exporters import export_dataset

export_dataset(
    target="qa_openai",                 # gemini | openai | qa_openai | qa_gemini | qa_jsonl
    output_path="out.jsonl",
    pairs=[{"question": "...", "answer": "..."}],  # or instruction records when using instruction targets
    options={"jsonl": True},
)

Instruction targets: gemini, openai. Q/A targets: qa_openai, qa_gemini, qa_jsonl.

Project structure

fine_tuning_dataset_preparation/code_dataset: tree-sitter extraction, instruction generation, paraphrasing
fine_tuning_dataset_preparation/document_dataset: Markdown ingestion, Q/A generation, prompt helpers
fine_tuning_dataset_preparation/common: LLM utilities, text helpers, exporters
examples/: runnable scripts for code and document pipelines, plus export helper
tests/: pytest suite organized by domain

Testing

pytest
pytest --cov=. --cov-report=term-missing

Tips

Match provider extras to the model you choose; the pinned requirements already bring tree-sitter grammars.
Export your API key(s) before running examples to avoid partial or templated outputs.
Pick models by task: fast models (Gemini 2.5 Flash, small OpenAI tiers) for bulk coverage/paraphrases; higher-fidelity models (Gemini 2.5 Pro, GPT-4.1) for final passes or sensitive Q/A. Raise temperature (0.7–0.9) for paraphrasing; keep it lower (0.2–0.5) for deterministic instruction/Q/A generation.

Project details

Release history Release notifications | RSS feed

This version

0.1.0

Nov 18, 2025

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

fine_tuning_dataset_preparation-0.1.0.tar.gz (26.1 kB view details)

Uploaded Nov 18, 2025 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

fine_tuning_dataset_preparation-0.1.0-py3-none-any.whl (36.3 kB view details)

Uploaded Nov 18, 2025 Python 3

File details

Details for the file fine_tuning_dataset_preparation-0.1.0.tar.gz.

File metadata

Download URL: fine_tuning_dataset_preparation-0.1.0.tar.gz
Upload date: Nov 18, 2025
Size: 26.1 kB
Tags: Source
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.2.0 CPython/3.12.10

File hashes

Hashes for fine_tuning_dataset_preparation-0.1.0.tar.gz
Algorithm	Hash digest
SHA256	`6769ebb312887caa18aaf802a4dda83351974f287056c9805c1cb3d712a774fb`
MD5	`0a43e01ebeab250d82966f6ea3efe301`
BLAKE2b-256	`8309b9320b6a205d4ab44bd8160fdd3264d09703603b4dff5964d3b3a574876a`

See more details on using hashes here.

File details

Details for the file fine_tuning_dataset_preparation-0.1.0-py3-none-any.whl.

File metadata

Download URL: fine_tuning_dataset_preparation-0.1.0-py3-none-any.whl
Upload date: Nov 18, 2025
Size: 36.3 kB
Tags: Python 3
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.2.0 CPython/3.12.10

File hashes

Hashes for fine_tuning_dataset_preparation-0.1.0-py3-none-any.whl
Algorithm	Hash digest
SHA256	`3a07ff2355a2b943a41daadc2ea5b43e37ad4a3ba62978dc000bb162f17df8d2`
MD5	`4b0d80f92b4dd9d192a980aa324c68e5`
BLAKE2b-256	`9e470d146e4929332bfa72c4cecae81d01bb2563069245becb8218b8ec2e51d8`

See more details on using hashes here.

fine-tuning-dataset-preparation 0.1.0

Navigation

Verified details

Maintainers

Unverified details

Project links

Meta

Project description

fine_tuning_dataset_preparation

Installation

Code pipeline

Document pipeline

Exporters

Project structure

Testing

Tips

Project details

Verified details

Maintainers

Unverified details

Project links

Meta

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

File details

File metadata

File hashes